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PREFACE 


INTRODUCTION 

This book was written to serve the needs of practicing engineers and computer 
scientists, and for students from a variety of backgrounds—computer science 
and engineering, electrical engineering, mathematics, operations research, and 
other disciplines—taking college- or professional-level courses. The held of 
high-reliability, high-availability, fault-tolerant computing was developed for 
the critical needs of military and space applications. NASA deep-space mis¬ 
sions are costly, for they require various redundancy and recovery schemes to 
avoid total failure. Advances in military aircraft design led to the development 
of electronic flight controls, and similar systems were later incorporated in the 
Airbus 330 and Boeing 777 passenger aircraft, where flight controls are tripli¬ 
cated to permit some elements to fail during aircraft operation. The reputation 
of the Tandem business computer is built on NonStop computing, a compre¬ 
hensive redundancy scheme that improves reliability. Modern computer storage 
uses redundant array of independent disks (RAID) techniques to link 50-100 
disks in a fast, reliable system. Various ideas arising from fault-tolerant com¬ 
puting are now used in nearly all commercial, military, and space computer 
systems; in the transportation, health, and entertainment industries; in institu¬ 
tions of education and government; in telephone systems; and in both fossil and 
nuclear power plants. Rapid developments in microelectronics have led to very 
complex designs; for example, a luxury automobile may have 30-40 micropro¬ 
cessors connected by a local area network! Such designs must be made using 
fault-tolerant techniques to provide significant software and hardware reliabil¬ 
ity, availability, and safety. 


xix 
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Computer networks are currently of great interest, and their successful oper¬ 
ation requires a high degree of reliability and availability. This reliability is 
achieved by means of multiple connecting paths among locations within a net¬ 
work so that when one path fails, transmission is successfully rerouted. Thus 
the network topology provides a complex structure of redundant paths that, in 
turn, provide fault tolerance, and these principles also apply to power distri¬ 
bution, telephone and water systems, and other networks. 

Fault-tolerant computing is a generic term describing redundant design tech¬ 
niques with duplicate components or repeated computations enabling uninter¬ 
rupted (tolerant) operation in response to component failure (faults). Some¬ 
times, system disasters are caused by neglecting the principles of redundancy 
and failure independence, which are obvious in retrospect. After the September 
11th, 2001, attack on the World Trade Center, it was revealed that although one 
company had maintained its primary system database in one of the twin tow¬ 
ers, it wisely had kept its backup copies at its Denver, Colorado office. Another 
company had also maintained its primary system database in one tower but, 
unfortunately, kept its backup copies in the other tower. 


COVERAGE 

Much has been written on the subject of reliability and availability since 
its development in the early 1950s. Fault-tolerant computing began between 
1965 and 1970, probably with the highly reliable and widely available AT&T 
electronic-switching systems. Starting with first principles, this book develops 
reliability and availability prediction and optimization methods and applies 
these techniques to a selection of fault-tolerant systems. Error-detecting and 
-correcting codes are developed, and an analysis is made of the probability 
that such codes might fail. The reliability and availability of parallel, standby, 
and voting systems are analyzed and compared, and such analyses are also 
applied to modern RAID memory systems and commercial Tandem and Stratus 
fault-tolerant computers. These principles are also used to analyze the primary 
avionics software system (PASS) and the backup flight control system (BFS) 
used on the Space Shuttle. Errors in software that control modern digital sys¬ 
tems can cause system failures; thus a chapter is devoted to software reliability 
models. Also, the use of software redundancy in the BFS is analyzed. 

Computer networks are fundamental to communications systems, and local 
area networks connect a wide range of digital systems. Therefore, the principles 
of reliability and availability analysis for computer networks are developed, 
culminating in an introduction to network design principles. The concluding 
chapter considers a large system with multiple possibilities for improving reli¬ 
ability by adding parallel or standby subsystems. Simple apportionment and 
optimization techniques are developed for designing the highest reliability sys¬ 
tem within a fixed cost budget. 

Four appendices are included to serve the needs of a variety of practitioners 
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and students: Appendices A and B, covering probability and reliability princi¬ 
ples for readers needing a review of probabilistic analysis; Appendix C, cov¬ 
ering architecture for readers lacking a computer engineering or computer sci¬ 
ence background; and Appendix D, covering reliability and availability mod¬ 
eling programs for large systems. 


USE AS A REFERENCE 

Often, a practitioner is faced with an initial system design that does not meet 
reliability or availability specifications, and the techniques discussed in Chap¬ 
ters 3, 4, and 7 help a designer rapidly evaluate and compare the reliability and 
availability gains provided by various improvement techniques. A designer or 
system engineer lacking a background in reliability will find the book’s devel¬ 
opment from first principles in the chapters, the appendices, and the exercises 
ideal for self-study or intensive courses and seminars on reliability and avail¬ 
ability. Intuition and quick analysis of proposed designs generally direct the 
engineer to a successful system; however, the efficient optimization techniques 
discussed in Chapter 7 can quickly yield an optimum solution and a range of 
good suboptima. 

An engineer faced with newly developed technologies needs to consult the 
research literature and other more specialized texts; the many references pro¬ 
vided can aid such a search. Topics of great importance are the error-correct¬ 
ing codes discussed in Chapter 2, the software reliability models discussed in 
Chapter 5, and the network reliability discussed in Chapter 6. Related exam¬ 
ples and analyses are distributed among several chapters, and the index helps 
the reader to trace the evolution of an example. 

Generally, the reliability and availability of large systems are calculated 
using fault-tolerant computer programs. Most industrial environments have 
these programs, the features of which are discussed in Appendix D. The most 
effective approach is to preface a computer model with a simplified analyti¬ 
cal model, check the results, study the sensitivity to parameter changes, and 
provide insight if improvements are necessary. 


USE AS A TEXTBOOK 

Many books that discuss fault-tolerant computing have a broad coverage of 
topics, with individual chapters contributed by authors of diverse backgrounds 
using different notations and approaches. This book selects the most important 
fault-tolerant techniques and examples and develops the concepts from first 
principles by using a consistent notation-and-analytical approach, with proba¬ 
bilistic analysis as the unifying concept linking the chapters. 

To use this book as a teaching text, one might: (a) cover the material 
sequentially—in the order of Chapter 1 to Chapter 7; (b) preface approach 
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(a) by reviewing probability; or (c) begin with Chapter 7 on optimization and 
cover Chapters 3 and 4 on parallel, standby, and voting reliability; then aug¬ 
ment by selecting from the remaining chapters. The sequential approach of (a) 
covers all topics and increases the analytical level as the course progresses; 
it can be considered a bottom-up approach. For a college junior- or senior- 
undergraduate-level or introductory graduate-level course, an instructor might 
choose approach (b); for an experienced graduate-level course, an instructor 
might choose approach (c). The homework problems at the end of each chapter 
are useful for self-study or classroom assignments. 

At Polytechnic University, fault-tolerant computing is taught as a one-term 
graduate course for computer science and computer engineering students at the 
master’s degree level, although the course is offered as an elective to senior- 
undergraduate students with a strong aptitude in the subject. Some consider 
fault-tolerant computing as a computer-systems course; others, as a second 
course in architecture. 


ACKNOWLEDGMENTS 

The author thanks Carol Walsh and Joann McDonald for their help in prepar¬ 
ing the class notes that preceded this book; the anonymous reviewers for their 
useful suggestions; and Professor Joanne Bechta Dugan of the University of 
Virginia and Dr. Robert Swarz of Miter Corporation (Bedford, Massachusetts) 
and Worcester Polytechnic for their extensive, very helpful comments. He is 
grateful also to Wiley editors Dr. Philip Meyler and Andrew Prince who pro¬ 
vided valuable advice. Many thanks are due to Dr. Alan P. Wood of Compaq 
Corporation for providing detailed information on Tandem computer design, 
discussed in Chapter 3, and to Larry Sherman of Stratus Computers for detailed 
information on Stratus, also discussed in Chapter 3. Sincere thanks are due to 
Sylvia Shooman, the author’s wife, for her support during the writing of this 
book; she helped at many stages to polish and improve the author’s prose and 
diligently proofread with him. 


Glen Cove, NY 
November 2001 


Martin L. Shooman 



1 _ 

INTRODUCTION 


The central theme of this book is the use of reliability and availability com¬ 
putations as a means of comparing fault-tolerant designs. This chapter defines 
fault-tolerant computer systems and illustrates the prime importance of such 
techniques in improving the reliability and availability of digital systems that 
are ubiquitous in the 21st century. The main impetus for complex, digital sys¬ 
tems is the microelectronics revolution, which provides engineers and scien¬ 
tists with inexpensive and powerful microprocessors, memories, storage sys¬ 
tems, and communication links. Many complex digital systems serve us in 
areas requiring high reliability, availability, and safety, such as control of air 
traffic, aircraft, nuclear reactors, and space systems. However, it is likely that 
planners of financial transaction systems, telephone and other communication 
systems, computer networks, the Internet, military systems, office and home 
computers, and even home appliances would argue that fault tolerance is nec¬ 
essary in their systems as well. The concluding section of this chapter explains 
how the chapters and appendices of this book interrelate. 


1.1 WHAT IS FAULT-TOLERANT COMPUTING? 

Literally, fault-tolerant computing means computing correctly despite the exis¬ 
tence of errors in a system. Basically, any system containing redundant com¬ 
ponents or functions has some of the properties of fault tolerance. A desktop 
computer and a notebook computer loaded with the same software and with 
hies stored on floppy disks or other media is an example of a redundant sys- 
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tem. Since either computer can be used, the pair is tolerant of most hardware 
and some software failures. 

The sophistication and power of modern digital systems gives rise to a host 
of possible sophisticated approaches to fault tolerance, some of which are as 
effective as they are complex. Some of these techniques have their origin in 
the analog system technology of the 1940s-1960s; however, digital technology 
generally allows the implementation of the techniques to be faster, better, and 
cheaper. Siewiorek [1992] cites four other reasons for an increasing need for 
fault tolerance: harsher environments, novice users, increasing repair costs, and 
larger systems. One might also point out that the ubiquitous computer system 
is at present so taken for granted that operators often have few clues on how 
to cope if the system should go down. 

Many books cover the architecture of fault tolerance (the way a fault-tolerant 
system is organized). However, there is a need to cover the techniques required 
to analyze the reliability and availability of fault-tolerant systems. A proper 
comparison of fault-tolerant designs requires a trade-off among cost, weight, 
volume, reliability, and availability. The mathematical underpinnings of these 
analyses are probability theory, reliability theory, component failure rates, and 
component failure density functions. 

The obvious technique for adding redundancy to a system is to provide a 
duplicate (backup) system that can assume processing if the operating (on-line) 
system fails. If the two systems operate continuously (sometimes called hot 
redundancy), then either system can fail first. However, if the backup system 
is powered down (sometimes called cold redundancy or standby redundancy), 
it cannot fail until the on-line system fails and it is powered up and takes over. 
A standby system is more reliable (i.e., it has a smaller probability of failure); 
however, it is more complex because it is harder to deal with synchronization 
and switching transients. Sometimes the standby element does have a small 
probability of failure even when it is not powered up. One can further enhance 
the reliability of a duplicate system by providing repair for the failed system. 
The average time to repair is much shorter than the average time to failure. 
Thus, the system will only go down in the rare case where the first system fails 
and the backup system, when placed in operation, experiences a short time to 
failure before an unusually long repair on the first system is completed. 

Failure detection is often a difficult task; however, a simple scheme called 
a voting system is frequently used to simplify such detection. If three systems 
operate in parallel, the outputs can be compared by a voter, a digital comparator 
whose output agrees with the majority output. Such a system succeeds if all 
three systems or two or the three systems work properly. A voting system can 
be made even more reliable if repair is added for a failed system once a single 
failure occurs. 

Modern computer systems often evolve into networks because of the flexible 
way computer and data storage resources can be shared among many users. 
Most networks either are built or evolve into topologies with multiple paths 
between nodes; the Internet is the largest and most complex model we all use. 
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If a network link fails and breaks a path, the message can be routed via one or 
more alternate paths maintaining a connection. Thus, the redundancy involves 
alternate paths in the network. 

In both of the above cases, the redundancy penalty is the presence of extra 
systems with their concomitant cost, weight, and volume. When the trans¬ 
mission of signals is involved in a communications system, in a network, or 
between sections within a computer, another redundancy scheme is sometimes 
used. The technique is not to use duplicate equipment but increased transmis¬ 
sion time to achieve redundancy. To guard against undetected, corrupting trans¬ 
mission noise, a signal can be transmitted two or three times. With two trans¬ 
missions the bits can be compared, and a disagreement represents a detected 
error. If there are three transmissions, we can essentially vote with the majority, 
thus detecting and correcting an error. Such techniques are called error-detect¬ 
ing and error-correcting codes, but they decrease the transmission speed by 
a factor of two or three. More efficient schemes are available that add extra 
bits to each transmission for error detection or correction and also increase 
transmission reliability with a much smaller speed-reduction penalty. 

The above schemes apply to digital hardware; however, many of the relia¬ 
bility problems in modern systems involve software errors. Modeling the num¬ 
ber of software errors and the frequency with which they cause system failures 
requires approaches that differ from hardware reliability. Thus, software reli¬ 
ability theory must be developed to compute the probability that a software 
error might cause system failure. Software is made more reliable by testing to 
find and remove errors, thereby lowering the error probability. In some cases, 
one can develop two or more independent software programs that accomplish 
the same goal in different ways and can be used as redundant programs. The 
meaning of independent software, how it is achieved, and how partial software 
dependencies reduce the effects of redundancy are studied in Chapter 5, which 
discusses software. 

Fault-tolerant design involves more than just reliable hardware and software. 
System design is also involved, as evidenced by the following personal exam¬ 
ples. Before a departing flight I wished to change the date of my return, but the 
reservation computer was down. The agent knew that my new return flight was 
seldom crowded, so she wrote down the relevant information and promised to 
enter the change when the computer system was restored. I was advised to con¬ 
firm the change with the airline upon arrival, which I did. Was such a procedure 
part of the system requirements? If not, it certainly should have been. 

Compare the above example with a recent experience in trying to purchase 
tickets by phone for a concert in Philadelphia 16 days in advance. On my 
Monday call I was told that the computer was down that day and that nothing 
could be done. On my Tuesday and Wednesday calls I was told that the com¬ 
puter was still down for an upgrade, and so it took a week for me to receive 
a call back with an offer of tickets. How difficult would it have been to print 
out from memory files seating plans that showed seats left for the next week 
so that tickets could be sold from the seating plans? Many problems can be 
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avoided at little cost if careful plans are made in advance. The planners must 
always think “what do we do if ...?” rather than “it will never happen.” 

This discussion has focused on system reliability: the probability that the 
system never fails in some time interval. For many systems, it is acceptable 
for them to go down for short periods if it happens infrequently. In such cases, 
the system availability is computed for those involving repair. A system is said 
to be highly available if there is a low probability that a system will be down 
at any instant of time. Although reliability is the more stringent measure, both 
reliability and availability play important roles in the evaluation of systems. 

1.2 THE RISE OF MICROELECTRONICS AND THE COMPUTER 
1.2.1 A Technology Timeline 

The rapid rise in the complexity of tasks, hardware, and software is why fault 
tolerance is now so important in many areas of design. The rise in complexity 
has been fueled by the tremendous advances in electrical and computer tech¬ 
nology over the last 100-125 years. The low cost, small size, and low power 
consumption of microelectronics and especially digital electronics allow prac¬ 
tical systems of tremendous sophistication but with concomitant hardware and 
software complexity. Similarly, the progress in storage systems and computer 
networks has led to the rapid growth of networks and systems. 

A timeline of the progress in electronics is shown in Shooman [1990, Table 
K-l]. The starting point is the 1874 discovery that the contact between a metal 
wire and the mineral galena was a rectifier. Progress continued with the vacuum 
diode and triode in 1904 and 1905. Electronics developed for almost a half-cen¬ 
tury based on the vacuum tube and included AM radio, transatlantic radiotele¬ 
phony, FM radio, television, and radar. The field began to change rapidly after 
the discovery of the point contact and field effect transistor in 1947 and 1949 
and, ten years later in 1959, the integrated circuit. 

The rise of the computer occurred over a time span similar to that of micro¬ 
electronics, but the more significant events occurred in the latter half of the 
20th century. One can begin with the invention of the punched card tabulating 
machine in 1889. The first analog computer, the mechanical differential ana¬ 
lyzer, was completed in 1931 at MIT, and analog computation was enhanced by 
the invention of the operational amplifier in 1938. The first digital computers 
were electromechanical; included are the Bell Labs’ relay computer (1937-40), 
the Zl, Z2, and Z3 computers in Germany (1938-41), and the Mark I com¬ 
pleted at Harvard with IBM support (1937^14). The ENIAC developed at the 
University of Pennsylvania between 1942 and 1945 with U.S. Army support 
is generally recognized as the first electronic computer; it used vacuum tubes. 
Major theoretical developments were the general mathematical model of com¬ 
putation by Alan Turing in 1936 and the stored program concept of computing 
published by John von Neuman in 1946. The next hardware innovations were 
in the storage field: the magnetic-core memory in 1950 and the disk drive 
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in 1956. Electronic integrated circuit memory came later in 1975. Software 
improved greatly with the development of high-level languages: FORTRAN 
(1954-58), ALGOL (1955-56), COBOL (1959-60), PASCAL (1971), the C 
language (1973), and the Ada language (1975-80). For computer advances 
related to cryptography, see problem 1.25. 

The earliest major computer systems were the U.S. Airforce SAGE air 
defense system (1955), the American Airlines SABER reservations system 
(1957-64), the first time-sharing systems at Dartmouth using the BASIC lan¬ 
guage (1966) and the MULTICS system at MIT written in the PL-I language 
(1965-70), and the first computer network, the ARPA net, that began in 1969. 
The concept of RAID fault-tolerant memory storage systems was first pub¬ 
lished in 1988. The major developments in operating system software were 
the UNIX operating system (1969-70), the CM operating system for the 8086 
Microprocessor (1980), and the MS-DOS operating system (1981). The choice 
of MS-DOS to be the operating system for IBM’s PC, and Bill Gates’ fledgling 
company as the developer, led to the rapid development of Microsoft. 

The first home computer design was the Mark-8 (Intel 8008 Microproces¬ 
sor), published in Radio-Electronics magazine in 1974, followed by the Altair 
personal computer kit in 1975. Many of the giants of the personal computing 
field began their careers as teenagers by building Altair kits and programming 
them. The company then called Micro Soft was founded in 1975 when Gates 
wrote a BASIC interpreter for the Altair computer. Early commercial personal 
computers such as the Apple II, the Commodore PET, and the Radio Shack 
TRS-80, all marketed in 1977, were soon eclipsed by the IBM PC in 1981. 
Early widely distributed PC software began to appear in 1978 with the Word¬ 
star word processing system, the VisiCalc spreadsheet program in 1979, early 
versions of the Windows operating system in 1985, and the first version of the 
Office business software in 1989. For more details on the historical develop¬ 
ment of microelectronics and computers in the 20th century, see the following 
sources: Ditlea [1984], Randall [1975], Sammet [1969], and Shooman [1983]. 
Also see www.intel.com and www.microsoft.com. 

This historical development leads us to the conclusion that today one can 
build a very powerful computer for a few hundred dollars with a handful of 
memory chips, a microprocessor, a power supply, and the appropriate input, 
output, and storage devices. The accelerating pace of development is breath¬ 
taking, and of course all the computer memory will be filled with software 
that is also increasing in size and complexity. The rapid development of the 
microprocessor—in many ways the heart of modern computer progress—is 
outlined in the next section. 

1.2.2 Moore’s Law of Microprocessor Growth 

The growth of microelectronics is generally identified with the growth of 
the microprocessor, which is frequently described as “Moore’s Law” [Mann, 
2000]. In 1965, Electronics magazine asked Gordon Moore, research director 
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TABLE 1.1 Complexity of Microchips and Moore’s Law 


Year 

Microchip Complexity: 
Transistors 

Moore’s Law 
Complexity: Transistors 

1959 

1 

2° = 1 

1964 

32 

2 5 = 32 

1965 

64 

2 6 = 64 

1975 

64,000 

2 16 = 65,536 


of Fairchild Semiconductor, to predict the future of the microchip industry. 
From the chronology in Table 1.1, we see that the first microchip was invented 
in 1959. Thus the complexity was then one transistor. In 1964, complexity had 
grown to 32 transistors, and in 1965, a chip in the Fairchild R&D lab had 64 
transistors. Moore projected that chip complexity was doubling every year, 
based on the data for 1959, 1964, and 1965. By 1975, the complexity had 
increased by a factor of 1,000; from Table 1.1, we see that Moore’s Law was 
right on track. In 1975, Moore predicted that the complexity would continue to 
increase at a slightly slower rate by doubling every two years. (Some people 
say that Moore’s Law complexity predicts a doubling every 18 months.) 

In Table 1.2, the transistor complexity of Intel’s CPUs is compared with 


TABLE 1.2 Transistor Complexity of Microprocessors and Moore’s Law 
Assuming a Doubling Period of Two Years 

Microchip 

Complexity 


_ Moore’s Law Complexity: 

Year CPU Transistors Transistors 


1971.50 

4004 

2,300 

1978.75 

8086 

31,000 

1982.75 

80286 

110,000 

1985.25 

80386 

280,000 

1989.75 

80486 

1,200,000 

1993.25 

Pentium (P5) 

3,100,000 

1995.25 

Pentium Pro 
(P6) 

5,500,000 

1997.50 

Pentium II 
(P6 + MMX) 

7,500,000 

1998.50 

Merced (P7) 

14,000,000 

1999.75 

Pentium III 

28,000,000 

2000.75 

Pentium 4 

42,000,000 


(2°) x 2,300 = 2,300 

(2 7 -25/2) x 2,300 = 28,377 

(2 4 / 2 ) x 28,377 = 113,507 
(2 2 - 5 /2) x 113,507 = 269,967 
( 2 4 - 5 / 2 ) x 269,967 = 1,284,185 
(2 3 - 5 / 2 ) x 1,284,185 = 4,319,466 
(2 2 / 2 ) x 4,319,466 = 8,638,933 

(22.25/2) x 8 ,63 8,93 3 = 18,841,647 

(23.25/2) x 8,638,933 = 26,646,112 
(2 1,25 /2) x 26,646,112 = 41,093,922 
(2 1 / 2 ) x 41,093,922 = 58,115,582 


Note: This table is based on Intel's data from its Microprocessor Report: http://www.physics.udel. 
edu/wwwusers. watson.scenl03/intel.html. 
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Moore’s Law, with a doubling every two years. Note that there are many 
closely spaced releases with different processor speeds; however, the table 
records the first release of the architecture, generally at the initial speed. 
The Pentium P5 is generally called Pentium I, and the Pentium II is a P6 
with MMX technology. In 1993, with the introduction of the Pentium, the 
Intel microprocessor complexities fell slightly behind Moore’s Law. Some 
say that Moore’s Law no longer holds because transistor spacing cannot be 
reduced rapidly with present technologies [Mann, 2000; Markov, 1999]; how¬ 
ever, Moore, now Chairman Emeritus of Intel Corporation, sees no funda¬ 
mental barriers to increased growth until 2012 and also sees that the physical 
limitations on fabrication technology will not be reached until 2017 [Moore, 
2000], 

The data in Table 1.2 is plotted in Fig. 1.1 and shows a close fit to Moore’s 
Law. The three data points between 1997 and 2000 seem to be below the curve; 
however, the Pentium 4 data point is back on the Moore’s Law line. Moore’s 
Law fits the data so well in the first 15 years (Table 1.1) that Moore has occu¬ 
pied a position of authority and respect at Fairchild and, later, Intel. Thus, 
there is some possibility that Moore’s Faw is a self-fulfilling prophecy: that 
is, the engineers at Intel plan their new projects to conform to Moore’s Faw. 
The problems presented at the end of this chapter explore how Moore’s Faw 
is faring in the 21st century. 

An article by Professor Seth Lloyd of MIT in the September 2000 issue 
of Nature explores the fundamental limitations of Moore’s Law for a laptop 
based on the following: Einstein’s Special Theory of Relativity (E = me 2 ), 
Heisenberg’s Uncertainty Principle, maximum entropy, and the Schwarzschild 
Radius for a black hole. For a laptop with one kilogram of mass and one liter 
of volume, the maximum available power is 25 million megawatt hours (the 
energy produced by all the world’s nuclear power plants in 72 hours); the ulti¬ 
mate speed is 5.4 x 10 5H hertz (about 10 43 the speed of the Pentium 4); and 
the memory size would be 2.1 x 10 31 bits, which is 4 x 10 3<) bytes (1.6 x 
10 22 times that for a 256 megabyte memory) [Johnson, 2000]. Clearly, fabri¬ 
cation techniques will limit the complexity increases before these fundamental 
limitations. 


1.2.3 Memory Growth 

Memory size has also increased rapidly since 1965, when the PDP-8 mini¬ 
computer came with 4 kilobytes of core memory and when an 8 kilobyte sys¬ 
tem was considered large. In 1981, the IBM personal computer was limited 
to 640,000 kilobytes of memory by the operating system’s nearsighted spec¬ 
ifications, even though many “workaround” solutions were common. By the 
early 1990s, 4 or 8 megabyte memories for PCs were the rule, and in 2000, 
the standard PC memory size has grown to 64-128 megabytes. Disk memory 
has also increased rapidly: from small 32-128 kilobyte disks for the PDP 8e 
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Figure 1.1 Comparison of Moore’s Law with Intel data. 


computer in 1970 to a 10 megabyte disk for the IBM XT personal computer 
in 1982. From 1991 to 1997, disk storage capacity increased by about 60% 
per year, yielding an eighteenfold increase in capacity [Fisher, 1997; Markoff, 
1999]. In 2001, the standard desk PC came with a 40 gigabyte hard drive. 
If Moore’s Law predicts a doubling of microprocessor complexity every two 
years, disk storage capacity has increased by 2.56 times each two years, faster 
than Moore’s Law. 


Tcam-Flij 
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1.2.4 Digital Electronics in Unexpected Places 

The examples of the need for fault tolerance discussed previously focused on 
military, space, and other large projects. There is no less a need for fault toler¬ 
ance in the home now that electronics and most electrical devices are digital, 
which has greatly increased their complexity. In the 1940s and 1950s, the most 
complex devices in the home were the superheterodyne radio receiver with 5 
vacuum tubes, and early black-and-white television receivers with 35 vacuum 
tubes. Today, the microprocessor is ubiquitous, and, since a large percentage of 
modern households have a home computer, this is only the tip of the iceberg. 
In 1997, the sale of embedded microcomponents (simpler devices than those 
used in computers) totaled 4.6 billion, compared with about 100 million micro¬ 
processors used in computers. Thus computer microprocessors only represent 
2% of the market [Hafner, 1999; Pollack, 1999]. 

The bewildering array of home products with microprocessors includes 
the following: clothes washers and dryers; toasters and microwave ovens; 
electronic organizers; digital televisions and digital audio recorders; home 
alarm systems and elderly medic alert systems; irrigation systems; pacemak¬ 
ers; video games; Web-surfing devices; copying machines; calculators; tooth¬ 
brushes; musical greeting cards; pet identification tags; and toys. Of course 
this list does not even include the cellular phone, which may soon assume 
the functions of both a personal digital assistant and a portable Internet inter¬ 
face. It has been estimated that the typical American home in 1999 had 40-60 
microprocessors—a number that could grow to 280 by 2004. In addition, a 
modern family sedan contains about 20 microprocessors, while a luxury car 
may have 40-60 microprocessors, which in some designs are connected via a 
local area network [Stepler, 1998; Hafner, 1999]. 

Not all these devices are that simple either. An electronic toothbrush has 
3,000 lines of code. The Furby, a $30 electronic-robotic pet, has 2 main pro¬ 
cessors, 21,600 lines of code, an infrared transmitter and receiver for Furby- 
to-Furby communication, a sound sensor, a tilt sensor, and touch sensors on 
the front, back, and tongue. In short supply before Christmas 1998, Web site 
prices rose as high as $147.95 plus shipping! [USA Today, 1998]. In 2000, the 
sensation was Billy Bass, a fish mounted on a wall plaque that wiggled, talked, 
and sang when you walked by, triggering an infrared sensor. 

Hackers have even taken an interest in Furby and Billy Bass. They have 
modified the hardware and software controlling the interface so that one Furby 
controls others. They have modified Billy Bass to speak the hackers’ dialog 
and sing their songs. 

Late in 2000, Sony introduced a second-generation dog-like robot called 
Aibo (Japanese for “pal”); with 20 motors, a 32-bit RISC processor, 32 
megabytes of memory, and an artificial intelligence program. Aibo acts like 
a frisky puppy. It has color-camera eyes and stereo-microphone ears, touch 
sensors, a sound-synthesis voice, and gyroscopes for balance. Four different 
“personality” modules make this $1,500 robot more than a toy [Pogue, 2001]. 
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What is the need for fault tolerance in such devices? If a Furby fails, you 
discard it, but it would be disappointing if that were the only sensible choice 
for a microwave oven or a washing machine. It seems that many such devices 
are designed without thought of recovery or fault-tolerance. Lawn irrigation 
timers, VCRs, microwave ovens, and digital phone answering machines are all 
upset by power outages, and only the best designs have effective battery back¬ 
ups. My digital answering machine was designed with an effective recovery 
mode. The battery backup works well, but it “locks up” and will not function 
about once a year. To recover, the battery and AC power are disconnected for 
about 5 minutes; when the power is restored, a 1.5-minute countdown begins, 
during which the device reinitializes. There are many stories in which failure 
of an ignition control computer stranded an auto in a remote location at night. 
Couldn’t engineers develop a recovery mode to limp home, even if it did use a 
little more gas or emit fumes on the way home? Sufficient fault-tolerant tech¬ 
nology exists; however, designers have to use it. Fortunately, the cellular phone 
allows one to call for help! 

Although the preceding examples relate to electronic systems, there is no 
less a need for fault tolerance in mechanical, pneumatic, hydraulic, and other 
systems. In fact, almost all of us need a fault-tolerant emergency procedure to 
heat our homes in case of prolonged power outages. 


1.3 RELIABILITY AND AVAILABILITY 
1.3.1 Reliability Is Often an Afterthought 

The attainment of high reliability and availability is very difficult to achieve in 
very complex systems. Thus, a system designer should formulate a number of 
different approaches to a problem and weigh the pluses and minuses of each 
design before recommending an approach. One should be careful to base con¬ 
clusions on an analysis of facts, not on conjecture. Sometimes the best solution 
includes simplifying the design a bit by leaving out some marginal, complex 
features. It may be difficult to convince the authors of the requirements that 
sometimes “less is more,” but this is sometimes the best approach. Design deci¬ 
sions often change as new technology is introduced. At one time any attempt to 
digitize the Library of Congress would have been judged infeasible because of 
the storage requirement. However, by using modern technology, this could be 
accomplished with two modern RAID disk storage systems such as the EMC 
Symmetrix systems, which store more than nine terabytes (9 x 10 12 bytes) 
[EMC Products-At-A-Glance, www.emc.com]. The computation is outlined in 
the problems at the end of this chapter. 

Reliability and availability of the system should always be two factors that 
are included, along with cost, performance, time of development, risk of fail¬ 
ure, and other factors. Sometimes it will be necessary to discard a few design 
objectives to achieve a good design. The system engineer should always keep 
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in mind that the design objectives generally contain a list of key features and a 
list of desirable features. The design must satisfy the key features, but if one or 
two of the desirable features must be eliminated to achieve a superior design, 
the trade-off is generally a good one. 

1.3.2 Concepts of Reliability 

Formal definitions of reliability and availability appear in Appendices A and 
B; however, the basic ideas are easy to convey without a mathematical devel¬ 
opment, which will occur later. Both of these measures apply to how good the 
system is and how frequently it goes down. An easy way to introduce reliabil¬ 
ity is in terms of test data. If 50 systems operate for 1,000 hours on test and 
two fail, then we would say the probability of failure, Pf, for this system in 
1,000 hours of operation is 2/50 or Pf( 1,000) = 0.04. Clearly the probability 
of success, P s , which is known as the reliability, R, is given by R( 1.000) = 
P s ( 1.000) = 1 - Pf ( 1,000) =48/50 =0.96. Thus, reliability is the probability 
of no failure within a given operating period. One can also deal with a fail¬ 
ure rate, fr, for the same system that, in the simplest case, would be fr =2 
failures/(50 X 1,000) operating hours—that is, fr = 4 x 10 s or, as it is some¬ 
times stated, fr = z = 40 failures per million operating hours, where z is often 
called the hazard function. The units used in the telecommunications industry 
are fits (failures in time), which are failures per billion operating hours. More 
detailed mathematical development relates the reliability, the failure rate, and 
time. For the simplest case where the failure rate z is a constant (one gener¬ 
ally uses X to represent a constant failure rate), the reliability function can be 
shown to be R(t) = e kl . If we substitute the preceding values, we obtain 

R( 1,000) = e - 4 x 10 5 x 1 • 00 ° = 0.96 

which agrees with the previous computation. 

It is now easy to show that complexity causes serious reliability problems. 
The simplest system reliability model is to assume that in a system with n 
components, all the components must work. If the component reliability is R c , 
then the system reliability, R sys , is given by 

R sys (t) = [R c (t)f = [e Xt T = e- nX ' 

Consider the case of the first supercomputer, the CDC 6600 [Thornton, 
1970]. This computer had 400,000 transistors, for which the estimated fail¬ 
ure rate was then 4 x 10 9 failures per hour. Thus, even though the failure 
rate of each transistor was very small, the computer reliability for 1,000 hours 
would be 


X( 1,000) = e - 400 . 00 °x4x to 9 x l.ooo = 0 20 
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If we repeat the calculation for 100 hours, the reliability becomes 0.85. 
Remember that these calculations do not include the other components in the 
computer that can also fail. The conclusion is that the failure rate of devices 
with so many components must be very low to achieve reasonable reliabilities. 
Integrated circuits (ICs) improve reliability because each IC replaces hundreds 
of thousands or millions of transistors and also because the failure rate of an 
IC is low. See the problems at the end of this chapter for more examples. 

1.3.3 Elementary Fault-Tolerant Calculations 

The simplest approach to fault tolerance is classical redundancy, that is, to have 
an additional element to use if the operating one fails. As a simple example, let 
us consider a home computer in which constant usage requires it to be always 
available. A desktop will be the primary computer; a laptop will be the backup. 
The first step in the computation is to determine the failure rate of a personal 
computer, which will be computed from the author’s own experience. Table 1.3 
lists the various computers that the author has used in the home. There has been 
a total of 2 failures and 29 years of usage. Since each year contains 8,766 hours, 
we can easily convert this into a failure rate. The question becomes whether to 
estimate the number of hours of usage per year or simply to consider each year 
as a year of average use. We choose the latter for simplicity. Thus the failure 
rate becomes 2/29 = 0.069 failures per year, and the reliability of a single PC 
for one year becomes R(l) = e 0069 = 0.933. This means there is about a 6.7% 
probability of failure each year based on this data. 

If we have two computers, both must fail for us to be without a computer. 
Assuming the failures of the two computers are independent, as is generally 
the case, then the system failure is the product of the failure probabilities for 


TABLE 1.3 Home Computers Owned by the Author 


Computer 

Date of Ownership 

Failures 

Operating Years 

IBM XT Computer: Intel 

1983-90 

0 failures 

7 years 

8088 and 10 MB disk 

Home upgrade of XT to 

1990-95 

0 failures 

5 years 

Intel 386 Processor and 

65 MB disk 

IBM XT Components 
(repackaged in 1990) 

Repackaged plus 
added new 
components used: 
1990-92 

1 failure 

2 years 

Digital Equipment Laptop 

1992-99 

0 failures 

7 years 

386 and 80 MB disk 

IBM Compatible 586 

1995-2001 

1 failure 

6 years 

IBM Notebook 240 

1999-2001 

0 failures 

2 years 
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(a) 



(b) 

Figure 1.2 Examples of simple computer networks: (a), a tree network connecting 
the four cities; (b), a Hamiltonian network connecting the four cities. 


computer 1 (the primary) and computer 2 (the backup). Using the preceding 
failure data, the probability of one failure within a year should be 0.067; of 
two failures, 0.067 x 0.067 = 0.00449. Thus, the probability of having at least 
one computer for use is 0.9955 and the probability of having no computer at 
some time during the year is reduced from 6.7% to 0.45%—a decrease by a 
factor of 15. The probability of having no computer will really be much less 
since the failed computer will be rapidly repaired. 

As another example of reliability computations, consider the primitive com¬ 
puter network as shown in Fig. 1.2(a). This is called a tree topology because 
all the nodes are connected and there are no loops. Assume that p is the reli¬ 
ability for some time period for each link between the nodes. The probability 
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that Boston and New York are connected is the probability that one link is 
good, that is, p. The same probability holds for New York-Philadelphia and for 
Philadelphia-Pittsburgh, but the Boston-Philadelphia connection requires two 
links to work, the probability of which is p 1 . More commonly we speak of the 
all-terminal reliability, which is the probability that all cities are connected— p 3 
in this example—because all three links must be working. Thus if p = 0.9, the 
all-terminal reliability is 0.729. 

The reliability of a network is raised if we add more links so that loops 
are created. The Hamiltonian network shown in Fig. 1.2(b) has one more link 
than the tree and has a higher reliability. In the Hamiltonian network, all nodes 
are connected if all four links are working, which has a probability of p 4 . All 
nodes are still connected if there is a single link failure, which has a probability 
of three successes and one failure given by p 3 (1 - p). However, there are 4 
ways for one link to fail, so the probability of one link failing is 4p 3 (l-p). The 
reliability is the probability that there are zero failures plus the probability that 
there is one failure, which is given by [p 4 + 4p 3 (l - p)\. Assuming that p = 0.9 
as before, the reliability becomes 0.9477—a considerable improvement over 
the tree network. Some of the basic principles for designing and analyzing the 
reliability of computer networks are discussed in this book. 

1.3.4 The Meaning of Availability 

Reliability is the probability of no failures in an interval, whereas availability 
is the probability that an item is up at any point in time. Both reliability and 
availability are used extensively in this book as measures of performance and 
“yardsticks” for quantitatively comparing the effectiveness of various fault-tol¬ 
erant methods. Availability is a good metric to measure the beneficial effects of 
repair on a system. Suppose that an air traffic control system fails on the aver¬ 
age of once a year; we then would say that the mean time to failure (MTTF), 
was 8,766 hours (the number of hours in a year). If an airline’s reservation 
system went down 5 times in a year, we would say that the MTTF was 1/5 of 
the air traffic control system, or 1,753 hours. One would say that, based on the 
MTTF, the air traffic control system was much better; however, suppose we 
consider repair and calculate typical availabilities. A simple formula for cal¬ 
culating the system availability (actually, the steady-state availability), based 
on the Uptime and Downtime of the system, is given as follows: 

Uptime 

A = - 

Uptime + Downtime 

If the air traffic control system goes down for about 1 hour whenever it fails, 
the availability would be calculated by substitution into the preceding formula 
yielding A = (8,765) /(8,765 + 1) = 0.999886. In the case of the airline reserva¬ 
tion system, let us assume that the outages are short, averaging 1 minute each. 
Thus the cumulative downtime per year is five minutes = 0.083333 hours, and 
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the availability would be A = (8,765.916666)/(8,766) = 0.9999905. Comparing 
the unavailabilities (U = 1 - A), we see (1 - 0.999886)/(l - 0.9999905) =12. 
Thus, we can say that based on availability the reservation system is 12 times 
better than the air traffic control system. Clearly one must use both reliability 
and availability to compare such systems. 

A mathematical technique called Markov modeling will be used in this book 
to compute the availability for various systems. Rapid repair of failures in 
redundant systems greatly increases both the reliability and availability of such 
systems. 

1.3.5 Need for High Reliability and Safety in Fault-Tolerant Systems 

Fault-tolerant systems are generally required in applications involving a high 
level of safety, since a failure can injure or kill many people. A number of spec¬ 
ifications, held failure data, and calculations are listed in Table 1.4 to give the 
reader some appreciation of the ranges of reliability and availability required 
and realized for various fault-tolerant systems. 

A pattern emerges after some study of Table 1.4. The availability of several 
of the highly reliable fault-tolerant systems is similar. The availability require¬ 
ment for the ESS telephone switching system (0.9999943), which is spoken of 
as “5 nines 43” in shorthand fashion, is seen to be equaled or bettered by actual 
performance of “5 nines 05” for (3B, 1A) and “5 nines 62” for (3A). Often 
one will compare system availability by quoting the downtime: for example, 
5.7 hours per million for ESS requirements, 0.5 hours per million for (3B, 
1A), and 3.8 hours per million for (3A). The Tandem goal was “5 nines 60” 
and the Stratus quote was “5 nines 05.” Lastly, a standby system (if one could 
construct a fault-tolerant standby architecture) using 1985 technology would 
yield an availability of “5 nines 11.” It is interesting to speculate whether this 
represents some level of performance one is able to achieve under certain lim¬ 
itations or whether the only proven numbers ( the ESS switching systems) have 
become the goal others are quoting. The reader should remember that neither 
Tandem nor Stratus provides data on their held-demonstrated availability. 

In the aircraft held there are some established system safety standards for 
the probability of catastrophe. These are extracted in Table 1.5, which also 
shows data on avionics-software-problem occurrence rates. 

The two standards plus the software data quoted in Table 1.5 provide a 
rough but “overlapping” hierarchy of values. Some researchers have been pes¬ 
simistic about the possibility of proving before use the reliability of hardware 
or software with reliabilities of < 10 9 . To demonstrate such a probability, we 
would need to test 10,000 systems for 10 years (about 100,000 hours) with 1 or 
0 failures. Clearly this is not feasible, and one must rely on modeling and test 
data accumulated for major systems. However, from Shooman [1996], we can 
estimate that the U.S. air heet of larger passenger aircraft hew about 12,000,000 
flight hours in 1994 and today must hy about 20,000,000 hours. Thus if it were 
commercially feasible to install a new piece of equipment in every aircraft for 
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Software-Implemented Design requirements: — [Siewiorek, 1992, 
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TABLE 1.5 Aircraft Safety Standards and Data 


System Criticality 

Likelihood 

Probability of 
Failure/Flight Hr 

Nonessential a 

Probable 

> 1(T 5 

Essential 0 

Improbable 

O 

1 

<_n 

1 

O 

1 

Flight control^ (e.g., 
bombers, transports, 
cargo, and tanker) 

Extremely remote 

5 x 1(T 7 

Critical 0 

Extremely improbable 

< 1(T 9 

Avionics software 
failure rates 


Average failure rate of 

1.5 X 1(T 7 failures/hr 
for 6 major avionics 
systems 


°FAA, AC 25.1309-1A. 
fo MIL-F-9490. 

Source: [Shooman, 1996], 


one year and test it, but not have it connected to aircraft systems, one could 
generate 20,000,000 test hours. If no failures are observed, the statistical rule 
is to use 1/3 as the equivalent number of failures (see Section B3.5), and one 
could demonstrate a failure rate as low as (l/3)/20,000,000 =1.7 x 10 x . It 
seems clear that the 10 9 probabilities given in Table 1.5 are the reasons why 
10 9 was chosen for the goals of SIFT and FTMP in Table 1.4. 

1.4 ORGANIZATION OF THIS BOOK 
1.4.1 Introduction 

This book was written for a diverse audience, including system designers in 
industry and students from a variety of backgrounds. Appendices A and B, 
which discuss probability and reliability principles, are included for those read¬ 
ers who need to deepen or refresh their knowledge of these topics. Similarly, 
because some readers may need some background in digital electronics, there 
is Appendix C that discusses digital electronics and architecture and provides a 
systems-level summary of these topics. The emphasis of this book is on analy¬ 
sis of systems and optimum design approaches. For large industrial problems, 
this emphasis will serve as a prelude to complement and check more com¬ 
prehensive and harder-to-interpret computer analysis. Often the designer has 
to make a trade-off among several proposed designs. Many of the examples 
and some of the theory in this text address such trade-offs. The theme of the 
analysis and the trade-offs helps to unite the different subjects discussed in 
the various chapters. In many ways, each chapter is self-contained when it is 
accompanied by supporting appendix material; hence a practitioner can read 
sections of the book pertinent to his or her work, or an instructor can choose a 
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selected group of chapters for a classroom presentation. This first chapter has 
described the complex nature of modern system design, which is one of the 
primary reasons that fault tolerance is needed in most systems. 

1.4.2 Coding Techniques 

A standard technique for guarding the veracity of a digital message/signal is 
to transmit the message more than once or to attach additional check bits to 
the message to detect and sometimes correct errors caused by “noise” that 
have corrupted some bits. Such techniques, called error-detecting and error- 
correcting codes, are introduced in Chapter 2. These codes are used to detect 
and correct errors in communications, memory storage, and signal transmission 
within computers and circuitry. When errors are sparse, the standard parity-bit 
and Hamming codes, developed from basic principles in Chapter 2, are very 
successful. The effectiveness of such codes is compared based on the probabil¬ 
ities that the codes fail to detect multiple errors. The probability that the cod¬ 
ing and decoding chips may fail catastrophically is also included in the analy¬ 
sis. Some original work is introduced to show under which circumstances the 
chip failures are significant. In some cases, errors occur in groups of adjacent 
bits, and an introductory development of burst error codes, which are used in 
such cases, is presented. An introduction to more sophisticated Reed-Solomon 
codes concludes this chapter. 

1.4.3 Redundancy, Spares, and Repairs 

One way of improving system reliability is to reduce the failure rate of piv¬ 
otal individual components. Sometimes this is not a feasible or cost-effective 
approach to meeting very high reliability requirements. Chapter 3 introduces 
another technique—redundancy—and it considers the fundamental techniques 
of system and component redundancy. The standard approach is to have two (or 
more) units operating in parallel so that if one fails the other(s) take over. Paral¬ 
lel components are generally more efficient than parallel systems in improving 
the resulting reliability; however, some sort of “coupling device” is needed to 
parallel the units. The reliability of the coupling device is modeled, and under 
certain circumstances failures of this device may significantly degrade system 
reliability. Various approximations are developed to allow easy comparison of 
different approaches and, in addition, the system mean time to failure (MTTF) 
is also used to simplify computations. The effects of common-cause failures, 
which can negate much of the beneficial effects of redundancy, are discussed. 

The other major form of redundancy is standby redundancy, in which the 
redundant component is powered down until the on-line system fails. This is 
often superior to parallel reliability. In the standby case, the sensing system 
that detects failures and switches is more complex, and the reliability of this 
device is studied to assess the degradation in predicted reliability caused by the 
standby switch. The study of standby systems is based on Markov probability 
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models that are introduced in the appendices and deliberately developed in 
Chapter 3 because they will be used throughout the book. 

Repair improves the reliability of both parallel and standby systems, and 
Markov probability models are used to study the relative benefits of repair for 
both approaches. Markov modeling generates a set of differential equations that 
require a solution to complete the analysis. The Laplace transform approach is 
introduced and used to simplify the solution of the Markov equations for both 
reliability and availability analysis. 

Several computer architectures for fault tolerance are introduced and dis¬ 
cussed. Modern memory storage systems use the various RAID architectures 
based on an array of redundant disks. Several of the common RAID techniques 
are analyzed. The class of fault-tolerant computer systems called nonstop sys¬ 
tems is introduced. Also introduced and analyzed are two other systems: the 
Tandem system, which depends primarily on software fault tolerance, and the 
Stratus system, which uses hardware fault tolerance. A brief description of a 
similar system approach, a Sun computer system cluster, concludes the chapter. 

1.4.4 /V-Modular Redundancy 

The problem of comparing the proper functioning of parallel systems was dis¬ 
cussed earlier in this chapter. One of the benefits of a digital system is that all 
outputs are strings of Is or Os so that the comparison of outputs is simplified. 
Chapter 4 describes an approach that is often used to compare the outputs of 
three identical digital circuits processing the same input: triple modular redun¬ 
dancy (TMR). The most common circuit output is used as the system output 
(called majority voting). In the case of TMR, we assume that if outputs dis¬ 
agree, those two that are the same will together have a much higher probability 
of succeeding rather than failing. The voting device is simple, and the resulting 
system is highly reliable. As in the case of parallel or standby redundancy, the 
voting can be done at the system or subsystem level, and both approaches are 
modeled and compared. 

Although the voter circuit is simple, it can fail; the effect of voter reliabil¬ 
ity, much like coupler reliability in a parallel system, must then be included. 
The possibility of using redundant voters is introduced. Repair can be used to 
improve the reliability of a voter system, and the analysis utilizes a Markov 
model similar to that of Chapter 3. Various simplified approximations are intro¬ 
duced that can be used to analyze the reliability and availability of repairable 
systems. Also introduced are more advanced voting and consensus techniques. 
The redundant system of Chapter 3 is compared with the voting techniques of 
Chapter 4. 

1.4.5 Software Reliability and Recovery Techniques 

Programming of the computer in early digital systems was largely done in com¬ 
plex machine language or low-level assembly language. Memory was limited, 
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and the program had to be small and concise. Expert programmers often used 
tricks to fit the required functions into the small memory. Software errors—then 
as now—can cause the system to malfunction. The failure mode is different 
but no less disastrous than catastrophic hardware failures. Chapter 5 relates 
these program errors to resulting system failures. 

This chapter begins by describing in some detail the way programs are 
now developed in modern higher-level languages such as FORTRAN, COBOL, 
ALGOL, C, C+ +, and Ada. Large memories allow more complex tasks, and 
many more programmers are involved. There are many potential sources of 
errors, such as the following: (a), complex, error-prone specifications; (b), logic 
errors in individual modules (self-contained sections of the program); and (c), 
communications among modules. Sometimes code is incorporated from previ¬ 
ous projects without sufficient adaptation analysis and testing, causing subtle 
but disastrous results. A classical example of the hazards of reused code is the 
Ariane-5 rocket. The European Space Agency (ESA) reused guidance software 
from Ariane-4 in Ariane-5. On its maiden flight, June 4, 1996, Ariane-5 had to 
be destroyed 40 seconds into launch—a $500 million loss. Ariane-5 developed 
a larger horizontal velocity than Ariane-4, and a register overflowed. The soft¬ 
ware detected an exception, but instead of taking a recoverable action it shut 
off the processor as the specifications required. A more appropriate recovery 
action might have saved the flight. To cite the legendary Murphy’s Law, “If 
things can go wrong, they will,” and they did. Even better, we might devise a 
corollary that states “then plan for it” [Pfleeger, 1998, pp. 37-39]. 

Various mathematical models describing errors are introduced. The intro¬ 
ductory model is based on a simple assumption: the failure rate (error discov¬ 
ery rate) is proportional to the number of errors remaining in the software after 
it is tested and released. Combining this software failure rate with reliability 
theory leads to a software reliability model. The constants in such models are 
evaluated from test data recorded during software development. Applying such 
models during the test phase allows one to predict the reliability of the software 
once it is released for operational use. If the predicted reliability appears unsat¬ 
isfactory, the developer can improve testing to remove more errors, rewrite cer¬ 
tain problem modulus, or take other action to avoid the release of an unreliable 
product. 

Software redundancy can be utilized in some cases by using independently 
developed but functionally identical software. The extent to which common 
errors in independent software reduces the reliability gains is discussed; as a 
practical example, the redundant software in the NASA Space Shuttle is con¬ 
sidered. 

1.4.6 Networked Systems Reliability 

Networks are all around us. They process our telephone calls, connect us to the 
Internet, and connect private industry and government computer and informa¬ 
tion systems. In general, such systems have a high reliability and availability 
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because there is more than one path that connects all of the terminals in the net¬ 
work. Thus a single link failure will seldom interrupt communications because 
a duplicate path will exist. Since network geometry (topology) is usually com¬ 
plex, there are many paths between terminals, and therefore computation of 
network reliability is often difficult. Computer programs are available for such 
computations, two of which are referenced in the chapter. This chapter sys¬ 
tematically develops methods based on graph theory (cut-sets and tie-sets) for 
analysis of a network. Alternate methods for computation are also discussed, 
and the chapter concludes with the application of such methods to the design 
of a reliable backbone network. 

1.4.7 Reliability Optimization 

Initial design of a large, complex system focuses on several issues: (a), how to 
structure the project to perform the required functions; (b), how to meet the per¬ 
formance requirements; and (c), how to achieve the required reliability. Design¬ 
ers always focus on issues (a) and (b), but sometimes, at the peril of develop¬ 
ing an unreliable system, they spend a minimum of effort on issue (c). Chap¬ 
ter 7 develops techniques for optimizing the reliability of a proposed design 
by parceling out the redundancy to various subsystems. Choice among opti¬ 
mized candidate designs should be followed by a trade-off among the feasible 
designs, weighing the various pros and cons that include reliability, weight, 
volume, and cost. In some ways, one can view this chapter as a generalization 
of Chapter 3 for larger, more complex system designs. 

One simplified method of achieving optimum reliability is to meet the overall 
system reliability goal by fixing the level of redundancy for the various subsys¬ 
tems according to various apportionment rules. The other end of the optimization 
spectrum is to obtain an exact solution by means of exhaustively computing the 
reliability for all the possible system combinations. The Dynamic Programming 
method was developed as a way to eliminate many of the cases in an exhaustive 
computation scheme. Chapter 7 discusses the above methods as well as an effec¬ 
tive approximate method—a greedy algorithm, where the optimization is divided 
into a series of steps and the best choice is made for each step. 

The best method developed in this chapter is to establish a set of upper and 
lower bounds on the number of redundancies that can be assigned for each 
subsystem. It is shown that there is a modest number of possible cases, so 
an exhaustive search within the allowed bounds is rapid and computationally 
feasible. The bounded method displays the optimal configuration as well as 
many other close-to-optimum alternatives, and it provides the designer with a 
number of good solutions among which to choose. 

1.4.8 Appendices 

This book has been written for practitioners and students from a wide variety 
of disciplines. In cases where the reader does not have a background in either 
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probability or digital circuitry, or needs a review of principles, these appen¬ 
dices provide a self-contained development of the background material of these 
subjects. 

Appendix A develops probability from basic principles. It serves as a tuto¬ 
rial, review, or reference for the reader. 

Appendix B summarizes reliability theory and develops the relationships 
among reliability theory, conventional probability density and distributions 
functions, and the failure rate (hazard) function. The popular MTTF metric, as 
well as sample calculations, are given. Availability theory and Markov models 
are developed. 

Appendix C presents a concise introduction to digital circuit design and ele¬ 
mentary computer architecture. This will serve the reader who needs a back¬ 
ground to understand the architecture applications presented in the text. 

Appendix D discusses reliability, availability, and risk-modeling programs. 
Most large systems will require such software to aid in analysis. This appendix 
categorizes these programs and provides information to aid the reader in con¬ 
tacting the suppliers to make an informed choice among the products offered. 
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PROBLEMS 

1.1. Show that the combined capacity of several (two or three) modem 
disk storage systems, such as the EMC Symmetrix System that stores 
more than nine terabytes (9 x 10 12 bytes) [EMC Products-At-A-Glance, 
www.emc.com], could contain all the 26 million texts in the Library of 
Congress [Web search, Library of Congress]. 

(a) Assume that the average book has 400 pages. 

(b) Estimate the number of lines per page by counting lines in three 
different books. 

(c) Repeat (b) for the number of words per line. 

(d) Repeat (b) for the number of characters per word. 

(e) Use the above computations to find the number of characters in the 
26 million books. 

Assume that one character is stored in one byte and calculate the number 
of Symmetrix units needed. 

1.2. Estimate the amount of storage needed to store all the papers in a stan¬ 
dard four-drawer business filing cabinet. 

1.3. Estimate the cost of digitizing the books in the Library of Congress. 
How would you do this? 
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1 . 4 . Repeat problem 1.3 for the storage of problem 1.2. 

1 . 5 . Visit the Intel Web site and check the release dates and transistor com¬ 
plexities given in Table 1.2. 

1 . 6 . Repeat problem 1.5 for microprocessors from other manufacturers. 

1 . 7 . Extend Table 1.2 for newer processors from Intel and other manufactur¬ 
ers. 

1 . 8 . Search the Web for articles about the change of mainframes in the air 
traffic control system and identify the old and new computers, the past 
problems, and the expected improvements from the new computers. 
Hint: look at IEEE Computer and Spectrum magazines and the New York 
Times. 

1 . 9 . Do some research and try to determine if the storage density for optical 
copies (one page of text per square millimeter) is feasible with today’s 
optical technology. Compare this storage density with that of a modern 
disk or CD-ROM. 

1 . 10 . Make a list of natural, human, and equipment failures that could bring 
down a library system stored on computer disks. Explain how you could 
incorporate design features that would minimize such problems. 

1 . 11 . Complex solutions are not always needed. There are many good pro¬ 
grams for storing cooking recipes. Many cooks use a few index cards or 
a cookbook with paper slips to mark their favorite recipes. Discuss the 
pros and cons of each approach. Under what circumstances would you 
favor each approach? 

1 . 12 . An improved version of Basic, called GW Basic, followed the original 
Micro Soft Basic. “GW” did not stand for our first president or the uni¬ 
versity that bears his name. Try to find out what GW stands for and the 
origin of the software. 

1 . 13 . Estimate the number of failures per year for a family automobile and 
compute the failure rate (failures per mile). Assuming 10,000 miles 
driven per year, compute the number of failures per year. Convert this 
into failures per hour assuming that one drives 10,000 miles per year at 
an average speed of 40 miles per hour. 

1 . 14 . Assume that an auto repair takes 8 hours, including drop-off, storage, 
and pickup of the car. Using the failure rate computed in problem 1.13 
and this information, compute the availability of an automobile. 

1 . 15 . Make a list of safety critical systems that would benefit from fault tol¬ 
erance. Suggest design features that would help fault tolerance. 

1 . 16 . Search the Web for examples of the systems in problem 1.15 and list 
the details you can find. Comment. 


Team-Ffy * 
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1 . 17 . Repeat problems 1.15 and 1.16 for systems in the home. 

1 . 18 . Repeat problems 1.15 and 1.16 for transportation, communication, 
power, heating and cooling, and entertainment systems in everyday use. 

1 . 19 . To learn of a 180 terabyte storage project, search the EMC Web site for 
the movie producer Steven Spielberg, or see the New York Times'. Jan. 
13, 2001, p. Bll. Comment. 

1 . 20 . To learn of some of the practical problems in trying to improve an exist¬ 
ing fault-tolerant system, consider the U.S. air traffic control system. 
Search the Web for information on the current delays, the effects of 
deregulation, and former President Ronald Reagan’s dismissal of strik¬ 
ing air traffic controllers; also see Zuckerman [2000]. A large upgrade 
to the system failed and incremental upgrades are being planned instead. 
Search the Web and see [Wald, 1996] for a discussion of why the 
upgrade failed. 

(a) Write a report analyzing what you learned. 

(b) What is the present status of the system and any upgrades? 

1 . 21 . Devise a scheme for emergency home heating in case of a prolonged 
power outage for a gas-fired, hot-water heating system. Consider the fol¬ 
lowing: (a), fireplace; (b), gas stove; (c), emergency generator; and (d), 
other. How would you make your home heating system fault tolerant? 

1 . 22 . How would problem 1.21 change for the following: 

(a) An oil-fired, hot-water heating system? 

(b) A gas-fired, hot-air heating system? 

(c) A gas-fired, hot-water heating system? 

1.23. Present two designs for a fault-tolerant voting scheme. 

1 . 24 . Investigate the speed of microprocessors and how rapidly it has 
increased over the years. You may wish to use the microprocessors in 
Table 1.2 or others as data points. A point on the curve is the 1.7 giga¬ 
hertz Pentium 4 microprocessor [New York Times, April 23, 2001, p. 
Cl]. Plot the data in a format similar to Fig. 1.1. Does a law hold for 
speed? 

1 . 25 . Some of the advances in mechanical and electronic computers occurred 
during World War II in conjunction with message encoding and decoding 
and cryptanalysis (code breaking). Some of the details were, and still are, 
classified as secret. Find out as much as you can about these machines 
and compare them with those reported on in Section 1.2.1. Hint: Fook 
in Randall [1975, pp. 327, 328] and Clark [1977, pp. 134, 135, 140, 
151, 195, 196]. Also, search the Web for key words: Sigaba, Enigma, 
T. H. Flowers, William F. Friedman, Alan Turing, and any patents by 
Friedman. 
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2.1 INTRODUCTION 

Many errors in a computer system are committed at the bit or byte level when 
information is either transmitted along communication lines from one computer 
to another or else within a computer from the memory to the microprocessor 
or from microprocessor to input/output device. Such transfers are generally 
made over high-speed internal buses or sometimes over networks. The simplest 
technique to protect against such errors is the use of error-detecting and error- 
correcting codes. These codes are discussed in this chapter in this context. In 
Section 3.9, we see that error-correcting codes are also used in some versions 
of RAID memory storage devices. 

The reader should be familiar with the material in Appendix A and Sections 
B1-B4 before studying the material of this chapter. It is suggested that this 
material be reviewed briefly or studied along with this chapter, depending on 
the reader’s background. 

The word code has many meanings. Messages are commonly coded and 
decoded to provide secret communication [Clark, 1977; Kahn, 1967], a prac¬ 
tice that technically is known as cryptography. The municipal rules governing 
the construction of buildings are called building codes. Computer scientists 
refer to individual programs and collections of programs as software, but many 
physicists and engineers refer to them as computer codes. When information 
in one system (numbers, alphabet, etc.) is represented by another system, we 
call that other system a code for the first. Examples are the use of binary num¬ 
bers to represent numbers or the use of the ASCII code to represent the letters, 
numerals, punctuation, and various control keys on a computer keyboard (see 
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Table C.l in Appendix C for more information). The types of codes that we 
discuss in this chapter are error-detecting and -correcting codes. The principle 
that underlies error-detecting and -correcting codes is the addition of specially 
computed redundant bits to a transmitted message along with added checks 
on the bits of the received message. These procedures allow the detection and 
sometimes the correction of a modest number of errors that occur during trans¬ 
mission. 

The computation associated with generating the redundant bits is called cod¬ 
ing ; that associated with detection or correction is called decoding. The use 
of the words message, transmitted, and received in the preceding paragraph 
reveals the origins of error codes. They were developed along with the math¬ 
ematical theory of information largely from the work of C. Shannon [1948], 
who mentioned the codes developed by Hamming [1950] in his original article. 
(For a summary of the theory of information and the work of the early pio¬ 
neers in coding theory, see J. R. Pierce [1980, pp. 159-163].) The preceding 
use of the term transmitted bits implies that coding theory is to be applied to 
digital signal transmission (or a digital model of analog signal transmission), in 
which the signals are generally pulse trains representing various sequences of 
Os and Is. Thus these theories seem to apply to the held of communications; 
however, they also describe information transmission in a computer system. 
Clearly they apply to the signals that link computers connected by modems 
and telephone lines or local area networks (LANs) composed of transceivers, 
as well as coaxial wire and fiber-optic cables or wide area networks (WANs) 
linking computers in distant cities. A standard model of computer architecture 
views the central processing unit (CPU), the address and memory buses, the 
input/output (I/O) devices, and the memory devices (integrated circuit memory 
chips, disks, and tapes) as digital signal (computer word) transmission, stor¬ 
age, manipulation, generation, and display devices. From this perspective, it is 
easy to see how error-detecting and -correcting codes are used in the design of 
modems, memory stems, disk controllers (optical, hard, or floppy), keyboards, 
and printers. 

The difference between error detection and error correction is based on the 
use of redundant information. It can be illustrated by the following electronic 
mail message: 

Meet me in Manhattan at the information desk at Senn Station on July 43. I will 

arrive at 12 noon on the train from Philadelphia. 

Clearly we can detect an error in the date, for extra information about the cal¬ 
endar tells us that there is no date of July 43. Most likely the digit should be a 1 
or a 2, but we can’t tell; thus the error can’t be corrected without further infor¬ 
mation. However, just a bit of extra knowledge about New York City railroad 
stations tells us that trains from Philadelphia arrive at Penn (Pennsylvania) Sta¬ 
tion in New York City, not the Grand Central Terminal or the PATH Terminal. 
Thus, Senn is not only detected as an error, but is also corrected to Penn. Note 
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that in all cases, error detection and correction required additional (redundant) 
information. We discuss both error-detecting and error-correcting codes in the 
sections that follow. We could of course send return mail to request a retrans¬ 
mission of the e-mail message (again, redundant information is obtained) to 
resolve the obvious transmission or typing errors. 

In the preceding paragraph we discussed retransmission as a means of cor¬ 
recting errors in an e-mail message. The errors were detected by a redundant 
source and our knowledge of calendars and New York City railroad stations. In 
general, with pulse trains we have no knowledge of “the right answer.” Thus if 
we use the simple brute force redundancy technique of transmitting each pulse 
sequence twice, we can compare them to detect errors. (For the moment, we 
are ignoring the rare situation in which both messages are identically corrupted 
and have the same wrong sequence.) We can, of course, transmit three times, 
compare to detect errors, and select the pair of identical messages to provide 
error correction, but we are again ignoring the possibility of identical errors 
during two transmissions. These brute force methods are inefficient, as they 
require many redundant bits. In this chapter, we show that in some cases the 
addition of a single redundant bit will greatly improve error-detection capabili¬ 
ties. Also, the efficient technique for obtaining error correction by adding more 
than one redundant bit are discussed. The method based on triple or N copies 
of a message are covered in Chapter 4. The coding schemes discussed so far 
rely on short “noise pulses,” which generally corrupt only one transmitted bit. 
This is generally a good assumption for computer memory and address buses 
and transmission lines; however, disk memories often have sequences of errors 
that extend over several bits, or burst errors, and different coding schemes are 
required. 

The measure of performance we use in the case of an error-detecting code 
is the probability of an undetected error, P ue , which we of course wish to min¬ 
imize. In the case of an error-correcting code, we use the probability of trans¬ 
mitted error, P e , as a measure of performance, or the reliability, R, (probability 
of success), which is (1 - P e ). Of course, many of the more sophisticated cod¬ 
ing techniques are now feasible because advanced integrated circuits (logic and 
memory) have made the costs of implementation (dollars, volume, weight, and 
power) modest. 

The type of code used in the design of digital devices or systems largely 
depends on the types of errors that occur, the amount of redundancy that is cost- 
effective, and the ease of building coding and decoding circuitry. The source 
of errors in computer systems can be traced to a number of causes, including 
the following: 

1. Component failure 

2. Damage to equipment 

3. “Cross-talk” on wires 

4. Lightning disturbances 
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5. Power disturbances 

6. Radiation effects 

7. Electromagnetic fields 

8. Various kinds of electrical noise 

Note that we can roughly classify sources 1, 2, and 3 as causes that are internal 
to the equipment; sources 4, 6, and 7 as generally external causes; and sources 5 
and 6 as either internal or external. Classifying the source of the disturbance is 
only useful in minimizing its strength, decreasing its frequency of occurrence, 
or changing its other characteristics to make it less disturbing to the equipment. 
The focus of this text is what to do to protect against these effects and how the 
effects can compromise performance and operation, assuming that they have 
occurred. The reader may comment that many of these error sources are rather 
rare; however, our desire for ultrareliable, long-life systems makes it important 
to consider even rare phenomena. 

The various types of interference that one can experience in practice can 
be illustrated by the following two examples taken from the aircraft held. 
Modern aircraft are crammed full of digital and analog electronic equipment 
that are generally referred to as avionics. Several recent instances of military 
crashes and civilian troubles have been noted in modern electronically con¬ 
trolled aircraft. These are believed to be caused by various forms of electro¬ 
magnetic interference, such as passenger devices (e.g., cellular telephones); 
“cross-talk” between various onboard systems; external signals (e.g., Voice 
of America Transmitters and Military Radar); lightning; and equipment mal¬ 
function [Shooman, 1993]. The systems affected include the following: auto¬ 
pilot, engine controls, communication, navigation, and various instrumentation. 
Also, a previous study by Cockpit (the pilot association of Germany) [Taylor, 
1988, pp. 285-287] concluded that the number of soft fails (probably from 
alpha particles and cosmic rays affecting memory chips) increased in modern 
aircraft. See Table 2.1 for additional information. 


TABLE 2.1 Increase of Soft Fails with Airplane Generation 


Airplane 

Type 

Altitude (1,000s feet) 


Total 

Reports 

No. of 
Aircraft 

Soft 

Fails 

per a/c 

Ground-5 

5-20 

20-30 

30+ 

B707 

2 

0 

0 

2 

4 

14 

0.29 

B727/737 

11 

7 

2 

4 

24 

39/28 

0.36 

B747 

11 

0 

1 

6 

18 

10 

1.80 

DC10 

21 

5 

0 

29 

55 

13 

4.23 

A3 00 

96 

12 

6 

17 

131 

10 

13.10 


Source: [Taylor, 1988]. 
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It is not clear how the number of flight hours varied among the different 
airplane types, what the computer memory sizes were for each of the aircraft, 
and the severity level of the fails. It would be interesting to compare this data 
to that observed in the operation of the most advanced versions of B747 and 
A320 aircraft, as well as other more recent designs. 

There has been much work done on coding theory since 1950 [Rao, 1989]. 
This chapter presents a modest sampling of theory as it applies to fault-tolerant 
systems. 


2.2 BASIC PRINCIPLES 

Coding theory can be developed in terms of the mathematical structure of 
groups, subgroups, rings, fields, vector spaces, subspaces, polynomial algebra, 
and Galois fields [Rao, 1989, Chapter 2]. Another simple yet effective devel¬ 
opment of the theory based on algebra and logic is used in this text [Arazi, 
1988]. 

2.2.1 Code Distance 

We will deal with strings of binary digits (0 or 1), which are of specified length 
and called the following synonymous terms: binary block, binary vector, binary 
word, or just code word. Suppose that we are dealing with a 3-bit message (b\, 
b 2 , bfi) represented by the bits x\, X 2 , v?,. We can speak of the eight combi¬ 
nations of these bits—see Table 2.2(a)—as the code words. In this case they 
are assigned according to the sequence of binary numbers. The distance of a 
code is the minimum number of bits by which any one code word differs from 
another. For example, the first and second code words in Table 2.2(a) differ 
only in the right-most digit and have a distance of 1, whereas the first and the 
last code words differ in all 3 digits and have a distance of 3. The total number 
of comparisons needed to check all of the word pairs for the minimum code 
distance is the number of combinations of 8 items taken 2 at a time (*), which 
is equal to 8!/2!6! = 28. 

A simpler way of visualizing the distance is to use the “cube method” of 
displaying switching functions. A cube is drawn in three-dimensional space (x, 
y, z), and a main diagonal goes from x = y = z=0 to x=y = z = 1. The distance 
is the number of cube edges between any two code words that represent the 
vertices of the cube. Thus, the distance between 000 and 001 is a single cube 
edge, but the distance between 000 and 111 is 3 since 3 edges must be traversed 
to get between the two vertices. (In honor of one of the pioneers of coding 
theory, the code distance is generally called the Hamming distance.) Suppose 
that noise changes a single bit of a code word from 0 to 1 or 1 to 0. The 
first code word in Table 2.2(a) would be changed to the second, third, or fifth, 
depending on which bit was corrupted. Thus there is no way to detect a single¬ 
bit error (or a multibit error), since any change in a code word transforms it 
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TABLE 2.2 Examples of 3- and 4-Bit Code Words 


(a) 

3-Bit Code 
Words 

(b) 

4-Bit Code Words: 

3 Original Bits plus 

Added Even-Parity 
(Legal Code Words) 

(c) 

Illegal Code Words 
for the Even-Parity 

Code of (b) 

Xl 

b\ 

*2 

bi 

V 3 

bi 

XI 

Pi 

X2 

b\ 

x 3 

b 2 

X4 

Z?3 

x\ 

P\ 

X2 

b\ 

x 3 

b2 

X4 
b 3 

0 

0 

0 

0 

0 

0 

0 

1 

0 

0 

0 

0 

0 

1 

1 

0 

0 

i 

0 

0 

0 

i 

0 

1 

0 

1 

0 

1 

0 

0 

0 

1 

0 

0 

1 

1 

0 

0 

1 

1 

1 

0 

1 

i 

1 

0 

0 

1 

1 

0 

0 

0 

1 

0 

0 

1 

0 

1 

0 

1 

0 

1 

1 

1 

0 

i 

1 

1 

0 

0 

1 

1 

0 

1 

1 

1 

0 

1 

1 

1 

1 

1 

1 

1 

0 

1 

1 

1 


into another legal code word. One can create error-detecting ability in a code 
by adding check bits, also called parity bits, to a code. 

The simplest coding scheme is to add one redundant bit. In Table 2.2(b), a 
single check bit (parity bit p \) is added to the 3-bit code words b\, hi, and bj, 
of Table 2.2(a), creating the eight new code words shown. The scheme used 
to assign values to the parity bit is the coding rule; in this case, p\ is chosen 
so that the number of one bits in each word is an even number. Such a code is 
called an even-parity code, and the words in Table 2.1(b) become legal code 
words and those in Table 2.1(c) become illegal code words. Clearly we could 
have made the number of one bits in each word an odd number, resulting in 
an odd-parity code, and so the words in Table 2.1(c) would become the legal 
ones and those in 2.1(b) become illegal. 

2.2.2 Check-Bit Generation and Error Detection 

The code generation rule (even parity) used to generate the parity bit in Table 
2.2(b) will now be used to design a parity-bit generator circuit. We begin with 
a Karnaugh map for the switching function pi (b\, bi, and bj) where the parity 
bit is a function of the three code bits as given in Fig. 2.1(a). The resulting 
Karnaugh map is given in this figure. The top left cell in the map corresponds 
to p\ = 0 when b\ , bi, and Ip = 000, whereas the top right cell represents p\ 
= 1 when bi, b2, and bj, = 001. These two cells represent the first two rows 
of Table 2.2(b); the other cells in the map represent the other six rows in the 
table. Since none of the ones in the Karnaugh map touch, no simplification is 
possible, and there are four minterms in the circuit, each generated by the four 
gates shown in the circuit. The OR gate “collects” these minterms, generating 
a parity check bit p\ whenever a sequence of pulses b\, I:p, and b\ occurs. 
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Karnaugh Map for 
Parity-Bit Generation 


Circuit for 

Parity-Bit Generation 



(a) 


Circuit for 
Error Detection 


Karnaugh Map for 
Error Detection 


S3- 

o 

o 

01 

11 

10 

00 

l 

0 

1 

0 

01 

0 

1 

0 

1 

n 

1 

0 

1 

0 

10 

0 

1 

0 

1 



Error 

Detection 


(b) 

Figure 2.1 Elementary parity-bit coding and decoding circuits, (a) Generation of an 
even-parity bit for a 3-bit code word, (b) Detection of an error for an even-parity-bit 
code for a 3-bit code word. 
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The addition of the parity bit creates a set of legal and illegal words; thus 
we can detect an error if we check for legal or illegal words. In Fig. 2.1(b) the 
Karnaugh map displays ones for legal code words and zeroes for illegal code 
words. Again, there is no simplification since all the minterms are separated, 
so the error detector circuit can be composed by generating all the illegal word 
minterms (indicated by zeroes) in Fig. 2.1(b) using eight AND gates followed 
by an 8-input OR gate as shown in the figure. The circuits derived in Fig. 
2.1 can be simplified by using exclusive or (EXOR) gates (as shown in the 
next section); however, we have demonstrated in Fig. 2.1 how check bits can 
be generated and how errors can be detected. Note that parity checking will 
detect errors that occur in either the message bits or the parity bit. 


2.3 PARITY-BIT CODES 

2.3.1 Applications 

Three important applications of parity-bit error-checking codes are as follows: 

1. The transmission of characters over telephone lines (or optical, micro- 
wave, radio, or satellite links). The best known application is the use of 
a modem to allow computers to communicate over telephone lines. 

2. The transmission of data to and from electronic memory (memory read 
and write operations). 

3. The exchange of data between units within a computer via various data 
and control buses. 

Specific implementation details may differ among these three applications, but 
the basic concepts and circuitry are very similar. We will discuss the first appli¬ 
cation and use it as an illustration of the basic concepts. 

2.3.2 Use of Exclusive OR Gates 

This section will discuss how an additional bit can be added to a byte for error 
detection. It is common to represent alphanumeric characters in the input and 
output phases of computation by a single byte. The ASCII code is almost uni¬ 
versally used. One technique uses the entire byte to represent 2 8 = 256 possible 
characters (the extended character set that is used on IBM personal computers, 
containing some Greek letters, language accent marks, graphic characters, and 
so forth, as well as an additional ninth parity bit. The other approach limits 
the character set to 128, which can be expressed by seven bits, and uses the 
eighth bit for parity. 

Suppose we wish to build a parity-bit generator and code checker for the 
case of seven message bits and one parity bit. Identifying the minterms will 
reveal a generalization of the checkerboard diagram similar to that given in the 
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Inputs 



1 = odd parity 

0 = even parity p x = b { ® b 2 ® t> 3 © b^® b 5 ® b 6 ® b-j 


(a) Parity-Bit Encoder (generator) 


Inputs 


Outputs 


1 = error 
0 = OK 


1 = error 
0 = OK 



even parity 


odd parity 


(b) Parity-Bit Decoder (checker) 

Figure 2.2 Parity-bit encoder and decoder for a transmitted byte: (a) A 7-bit parity 
encoder (generator); (b) an 8-bit parity decoder (checker). 


Karnaugh maps of Fig. 2.1. Such checkerboard patterns indicate that EXOR 
gates can be used to simplify the circuit. A circuit using EXOR gates for parity- 
bit generation and for checking of an 8-bit byte is given in Fig. 2.2. Note that 
the circuit in Fig. 2.2(a) contains a control input that allows one to easily switch 
from even to odd parity. Similarly, the addition of the NOT gate (inverter) at 
the output of the checking circuit allows one to use either even or odd parity. 
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Most modems have these refinements, and a switch chooses either even or odd 
parity. 


2.3.3 Reduction in Undetected Errors 

The purpose of parity-bit checking is to detect errors. The extent to which 
such errors are detected is a measure of the success of the code, whereas the 
probability of not detecting an error, P ue , is a measure of failure. In this section 
we analyze how parity-bit coding decreases P ue . We include in this analysis 
the reliability of the parity-bit coding and decoding circuit by analyzing the 
reliability of a standard IC parity code generator/checker. We model the failure 
of the IC chip in a simple manner by assuming that it fails to detect errors, and 
we ignore the possibility that errors are detected when they are not present. 

Let us consider the addition of a ninth parity bit to an 8-bit message byte. The 
parity bit adjusts the number of ones in the word to an even (odd) number and 
is computed by a parity-bit generator circuit that calculates the EXOR function 
of the 8 message bits. Similarly, an EXOR-detecting circuit is used to check for 
transmission errors. If 1, 3, 5, 7, or 9 errors are found in the received word, the 
parity is violated, and the checking circuit will detect an error. This can lead to 
several consequences, including “flagging” the error byte and retransmission of 
the byte until no errors are detected. The probability of interest is the probability 
of an undetected error, P' ue , which is the probability of 2, 4, 6, or 8 errors, since 
these combinations do not violate the parity check. These probabilities can be 
calculated by simply using the binomial distribution (see Appendix A5.3). The 
probability of r failures in n occurrences with failure probability q is given by the 
binomial probability B(r: n, q). Specifically, n = 9 (the number of bits) and q = the 
probability of an error per transmitted bit; thus 

General: 

B(r:9,q) = (^\q\l-qf r (2.1) 

Two errors: 

B{2:9,q) = [V\ q\l - q) 9 ~ 2 (2.2) 

Four errors: 

B(4:9,q) = ( 9 4 ) q\l-q) 9 - 4 (2.3) 


and so on. 
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For q, relatively small (10 4 ), it is easy to see that Eq. (2.3) is much smaller 
than Eq. (2.2); thus only Eq. (2.2) needs to be considered (probabilities for r 
= 4, 6, and 8 are negligible), and the probability of an undetected error with 
parity-bit coding becomes 

P' ue = B(2:9,q) = 36q 2 (l-q? (2.4) 

We wish to compare this with the probabilty of an undetected error for an 8-bit 
transmission without any checking. With no checking, all errors are undetected; 
thus we must compute 5( 1 : 8, q) + v 5(8 : 8, q), but it is easier to compute 


Pue =1-5(0 errors) = 1 - 5(0: 8, q) = 1 - ^ j q\\ - qf- 0 

= 1 - (1 - <7) 8 (2-5) 

Note that our convention is to use P ue for the case of no checking, and P' ue for 
the case of checking. 

The ratio of Eqs. (2.5) and (2.4) yields the improvement ratio due to the 
parity-bit coding as follows: 

Pue!P' ue = [1 - (1 - qf]/\36q 2 (l ~ </) 7 ] (2.6) 

For small q we can simplify Eq. (2.6) by replacing (1 ± q) n by 1 ± nq and 
[1/(1 - q)\ by 1 + q, which yields 

Pue! P' ue = [2(1 + lq)/9q\ (2.7) 

The parameter q. the probability of failure per bit transmitted, is quoted as 
10 4 in Hill and Peterson [1981]. The failure probability q was 10 5 or 10 6 
in the 1960s and ’70s; now, it may be as low as 10 7 for the best telephone 
lines [Rubin, 1990]. Equation (2.7) is evaluated for the range of q values; the 
results appear in Table 2.3 and in Fig. 2.3. 

The improvement ratio is quite significant, and the overhead—adding 1 par¬ 
ity bit out of 8 message bits—is only 12.5%, which is quite modest. This prob¬ 
ably explains why a parity-bit code is so frequently used. 

In the above analysis we assumed that the coder and decoder are perfect. We 
now examine the validity of that assumption by modeling the reliability of the 
coder and decoder. One could use a design similar to that of Fig. 2.2; however, 
it is more realistic to assume that we are using a commercial circuit device: the 
SN74180, a 9-bit odd/even parity generator/checker (see Texas Instruments 
[1988]), or the newer 74LS280 [Motorola, 1992]. The SN74180 has an equiv¬ 
alent circuit (see Fig. 2.4), which has 14 gates and inverters, whereas the pin- 
compatible 74LS280 with improved performance has 46 gates and inverters in 
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TABLE 2.3 Evaluation of the Reduction in Undetected 
Errors from Parity-Bit Coding: Eq. (2.7) 


Bit Error Probability, 

<7 

Improvement Ratio: 

Pue/P U e 

10~ 4 

2.223 X 10 3 

io - 5 

2.222 x 10 4 

10~ 6 

2.222 x 10 5 

10- 7 

2.222 x 10 6 

10~ 8 

2.222 x 10 7 


its equivalent circuit. Current prices of the SN74180 and the similar 74LS280 
ICs are about 10-75 cents each, depending on logic family and order quantity. 
We will use two such devices since the same chip can be used as a coder and 
a decoder (generator/checker). The logic diagram of this device is shown in 
Fig. 2.4. 



Bit Error Probability, q 

Figure 2.3 Improvement ratio of undetected error probability from parity-bit coding. 
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Figure 2.4 Logic diagram for SN74180 [Texas Instruments, 1988, used with permission]. 
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2.3.4 Effect of Coder-Decoder Failures 

An approximate model for IC reliability is given in Appendix B3.3, Fig. B7. 
The model assumes the failure rate of an integrated circuit is proportional to 
the square root of the number of gates, g, in the equivalent logic model. Thus 
the failure rate per million hours is given as A/, = C(g) 1 ^ 2 , where C was com¬ 
puted from 1985 IC failure-rate data as 0.004. We can use this model to esti¬ 
mate the failure rate and subsequently the reliability of an IC parity generator 
checker. In the equivalent gate model for the SN74180 given in Fig. 2.4, there 
are 5 EXNOR, 2 EXOR, 1 NOT, 4 AND, and 2 NOR gates. Note that the 
output gates (5) and (6) are NOR rather than OR gates. Sometimes for good 
and proper reasons integrated circuit designers use equivalent logic using dif¬ 
ferent gates. Assuming the 2 EXOR and 5 EXNOR gates use about 1.5 times 
as many transistors to realize their function as the other gates, we consider 
them as equivalent to 10.5 gates. Thus we have 17.5 equivalent gates and A b 
= 0.004( 17.5) 1/2 failures per million hours = 1.67 x 10 8 failures per hour. 

In formulating a reliability model for a parity-bit coder-decoder scheme, we 
must consider two modes of failure for the coded word: A, where the coder and 
decoder do not fail but the number of bit errors is an even number equal to 2 
or more; and S, where the coder or decoder chip fails. We ignore chip failure 
modes, which sometimes give correct results. The probability of undetected 
error with the coding scheme is given by 

P' ue = P(A + B) = P(A) + P(B) (2.8) 

In Eq. (2.8), the chip failure rates are per hour; thus we write Eq. (2.8) as 

P' ue = P\ no coder or decoder failure during 1 byte transmission] 
x P[ 2 or more errors] 

+ Z 3 !coder or decoder failure during 1 byte transmission] (2.9) 

If we let B be the bit transmission rate per second, then the number of 
seconds to transmit a bit is 1/S. Since a byte plus parity is 9 bits, it will take 
9/S seconds to transmit and 9/3,600S hours to transmit the 9 bits. 

If we assume a constant failure rate \b for the coder and decoder, the relia¬ 
bility of a coder-decoder pair is e 2khl and the probability of coder or decoder 
failure is (1 - e 2X1,1 ). The probability of 2 or more errors per hour is given by 
Eq. (2.4); thus Eq. (2.9) becomes 

P' ue = e- 2Xbl x 36<ry 2 (l - q ) 7 + (1 - e 1 ^) (2.10) 


where 


t = 9/3,600S 


(2.11) 
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TABLE 2.4 The Reduction in Undetected Errors from Parity-Rate Coding 
Including the Effect of Coder-Decoder Failures 

Improvement Ratio: P ue /P' ue for Several Transmission Rates 
Bit Error _ 


Probability 300 1,200 9,600 56,000 

q Bits/Sec Bits/Sec Bits/Sec Bits/Sec 


10 

-4 

2.223 

X 

10 3 

2.223 

X 

10 3 

2.223 

X 

10 3 

2.223 

X 

10 3 

10 

-5 

2.222 

X 

10 4 

2.222 

X 

10 4 

2.222 

X 

10 4 

2.222 

X 

10 4 

10 

-6 

2.228 

X 

10 5 

2.218 

X 

10 5 

2.222 

X 

10 5 

2.222 

X 

10 5 

10 

-7 

1.254 

X 

10 6 

1.962 

X 

10 6 

2.170 

X 

10 6 

2.213 

X 

10 6 

5 x 10~ 8 

1.087 

X 

10 6 

2.507 

X 

10 6 

4.053 

X 

10 6 

4.372 

X 

10 6 

10 

-8 

2.841 

X 

10 5 

1.093 

X 

10 6 

6.505 

X 

10 6 

1.577 

X 

10 7 


The undetected error probability with no coding is given by Eq. (2.5) and 
is independent of time 


P ue = l-(l-q)* (2.12) 

Clearly if the failure rate is small or the bit rate B is large, e = 1, the fail¬ 
ure probabilities of the coder-decoder chips are insignificant, and the ratio of Eq. 
(2.12) and Eq. (2.10) will reduce to Eq. (2.7) for high bit rates B. If we are using 
a parity code for memory bit checking, the bit rate will be essentially the mem¬ 
ory cycle time if we assume that a long succession of memory operations and 
the effect of chip failures are negligible. However, in the case of parity-bit cod¬ 
ing in a modem, the baud rate will be lower and chip failures can be significant, 
especially in the case where q is small. The ratio of Eq. (2.12) to Eq. (2.10) is 
evaluated in Table 2.4 (and plotted in Fig. 2.5) for typical modem bit rates B = 
300, 1,200,9,600, and 56,000. Note that the chip failure rate is insignificant for q 
=10 4 , 10 5 , and 10 : however, it does make a difference for q =10 7 and 10 s . 
If the bit rate B is infinite, the effect of chip failure disappears, and we can view 
Table 2.3 as depicting this case. 

2.4 HAMMING CODES 
2.4.1 Introduction 

In this section, we develop a class of codes created by Richard Hamming 
[1950], for whom they are named. These codes will employ c check bits to 
detect more than a single error in a coded word, and if enough check bits are 
used, some of these errors can be corrected. The relationships among the num¬ 
ber of check bits and the number of errors that can be detected and corrected 
are developed in the following section. It will not be surprising that the case 
in which c =1 results in a code that can detect single errors but cannot correct 
errors; this is the parity-bit code that we had just discussed. 
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Bit Error Probability, q 

Figure 2.5 Improvement ratio of undetected error probability from parity-bit coding 
(including the possibility of coder-decoder failure). B is the transmission rate in bits 
per second. 


2.4.2 Error-Detection and -Correction Capabilities 

We defined the concept of Hamming distance of a code in the previous section. 
Now, we establish the error-detecting and -correcting abilities of a code based 
on its Hamming distance. The following results apply to linear codes, in which 
the difference and sum between any two code words (addition and subtraction 
of their binary representations) is also a code word. Most of this chapter will 
deal with linear codes. The following notations are used in this chapter: 


d = the Hamming distance of a code (2.13) 

D = the number of errors that a code can detect (2.14a) 

C = the number of errors that a code can correct (2.14b) 

n = the total number of bits in the coded word (2.15a) 
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m = the number of message or information bits (2.15b) 

c = the number of check (parity) bits (2.15c) 

where d, D, C, n, m, and c are all integers > 0. 

As we said previously, the model we will use is one in which the check bits 
are added to the message bits by the coder. The message is then “transmitted,” 
and the decoder checks for any detectable errors. If there are enough check bits, 
and if the circuit is so designed, some of the errors are corrected. Initially, one 
can view the error-detection process as a check of each received word to see 
if the word belongs to the illegal set of words. Any set of errors that convert a 
legal code word into an illegal one are detected by this process, whereas errors 
that change a legal code word into another legal code word are not detected. 
To detect D errors, the Hamming distance must be at least one larger than D. 

d>D + 1 (2.16) 

This relationship must be so because a single error in a code word produces a 
new word that is a distance of one from the transmitted word. However, if the 
code has a basic distance of one, this error results in a new word that belongs 
to the legal set of code words. Thus for this single error to be detectable, the 
code must have a basic distance of two so that the new word produced by 
the error does not belong to the legal set and therefore must correspond to 
the detectable illegal set. Similarly, we could argue that a code that can detect 
two errors must have a Hamming distance of three. By using induction, one 
establishes that Eq. (2.16) is true. 

We now discuss the process of error correction. First, we note that to cor¬ 
rect an error we must be able to detect that an error has occurred. Suppose we 
consider the parity-bit code of Table 2.2. From Eq. (2.16) we know that d >2 
for error detection; in fact, d = 2 for the parity-bit code, which means that we 
have a set of legal code words that are separated by a Hamming distance of 
at least two. A single bit error creates an illegal code word that is a distance 
of one from more than 1 legal code word; thus we cannot correct the error 
by seeking the closest legal code word. For example, consider the legal code 
word 0000 in Table 2.2(b). Suppose that the last bit is changed to a one yield¬ 
ing 0001, which is the second illegal code word in Table 2.2(c). Unfortunately, 
the distance from that illegal word to each of the eight legal code words is 1, 
1,3, 1,3, 1,3, and 3 (respectively). Thus there is a four-way tie for the clos¬ 
est legal code word. Obviously we need a larger Hamming distance for error 
correction. Consider the number line representing the distance between any 2 
legal code words for the case of d = 3 shown in Fig. 2.6(a). In this case, if there 
is 1 error, we move 1 unit to the right from word a toward word h. We are 
still 2 units away from word b and at least that far away from any other word, 
so we can recognize word a as the closest and select it as the correct word. 
We can generalize this principle by examining Fig. 2.6(b). If there are C errors 
to correct, we have moved a distance of C away from code word «; to have this 
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Word a 


Word b 
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Distance 3 

3 


Word a 
o- 


Word a 
corrupted by 
c errors 


Word b 
-o 


Distance C -*-*-Distance C + 1 


(a) (b) 

Figure 2.6 Number lines representing the distances between two legal code words. 


word closer than any other word, we must have at least a distance of C + 1 
from the erroneous code word to the nearest other legal code word so we can 
correct the errors. This gives rise to the formula for the number of errors that 
can be corrected with a Hamming distance of d, as follows: 

d>2C+\ (2.17) 

Inspecting Eqs. (2.16) and (2.17) shows that for the same value of d, 

D> C (2.18) 

We can combine Eqs. (2.17) and (2.18) by rewriting Eq. (2.17) as 

d>C+C+ 1 (2.19) 

If we use the smallest value of D from Eq. (2.18), that is, D = C, and sub¬ 
stitute for one of the Cs in Eq. (2.19), we obtain 

d>D + C+ 1 (2.20) 

which summarizes and combines Eqs. (2.16) to (2.18). 

One can develop the entire class of Hamming codes by solving Eq. (2.20), 
remembering that D > C and that d, D, and C are integers > 0. For d = 1, D 
= C = 0—no code is possible; if d • • 2. D = 1, C =0 — we have the parity bit 
code. The class of codes governed by Eq. (2.20) is given in Table 2.5. 

The most popular codes are the parity code; the d=3 ,D = C = 1 
code—generally called a single error-correcting and single error-detecting 
(SECSED) code; and the d = 4, D = 2, C= 1 code—generally called a single 
error-correcting and double error-detecting (SECDED) code. 

2.4.3 The Hamming SECSED Code 

The Hamming SECSED code has a distance of 3, and corrects and detects 1 
error. It can also be used as a double error-detecting code (DED). 

Consider a Hamming SECSED code with 4 message bits {b\, b 2 , b 3 , and 7q) 
and 3 check bits (ci, C 2 , and C 3 ) that are computed from the message bits by equa¬ 
tions integral to the code design. Thus we are dealing with a 7-bit word. A brute 









48 CODING TECHNIQUES 


TABLE 2.5 Relationships Among d, D, and C 


d 

D 

C 

Type of Code 

1 

0 

0 

No code possible 

2 

1 

0 

Parity bit 

3 

1 

1 

Single error detecting; single error correcting 

3 

2 

0 

Double error detecting; zero error correcting 

4 

3 

0 

Triple error detecting; zero error correcting 

4 

2 

1 

Double error detecting; single error correcting 

5 

4 

0 

Quadruple error detecting; zero error correcting 

5 

3 

1 

Triple error detecting; single error correcting 

5 

2 

2 

Double error detecting; double error correcting 

6 

5 

0 

Quintuple error detecting; zero error correcting 

6 

4 

1 

Quadruple error detecting; single error correcting 

6 

3 

2 

Triple error detecting; double error correcting 

etc. 





force detection-correction algorithm would be to compare the coded word in 
question with all the 2 7 = 128 code words. No error is detected if the coded word 
matched any of the 2 4 = 16 legal combinations of message bits. No detected errors 
means either that none have occurred or that too many errors have occurred (the 
code is not powerful enough to detect so many errors). If we detect an error, we 
compute the distance between the illegal code word and the 16 legal code words 
and effect error correction by choosing the code word that is closest. Of course, 
this can be done in one step by computing the distance between the coded word 
and all 16 legal code words. If one distance is 0, no errors are detected; otherwise 
the minimum distance points to the corrected word. 

The information in Table 2.5 just tells us the possibilities in constructing a 
code; it does not tell us how to construct the code. Hamming [1950] devised a 
scheme for coding and decoding a SECSED code in his original work. Check 
bits are interspersed in the code word in bit positions that correspond to powers 
of 2. Word positions that are not occupied by check bits are filled with message 
bits. The length of the coded word is n bits composed of c check bits added to 
in message bits. The common notation is to denote the code word (also called 
binary word, binary block, or binary vector) as (n, m). As an example, consider 
a (7, 4) code word. The 3 check bits and 4 message bits are located as shown 
in Table 2.6. 


TABLE 2.6 Bit Positions for Hamming SECSED (d = 3) Code 

Bit positions x\ %2 xj X4 xs Xf, xj 

Check bits ci C 2 — C 3 — — — 

Message bits — — b\ — bi b$ bn 


Tcam-Flij 
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TABLE 2.7 Relationships Among n, c, and m for a SECSED 
Hamming Code 


Length, n 

Check Bits, c 

Message Bits, m 

1 

1 

0 

2 

2 

0 

3 

2 

1 

4 

3 

1 

5 

3 

2 

6 

3 

3 

7 

3 

4 

8 

4 

4 

9 

4 

5 

10 

4 

6 

11 

4 

7 

12 

4 

8 

13 

4 

9 

14 

4 

10 

15 

4 

11 

16 

5 

11 

etc. 




In the code shown, the 3 check bits are sufficient for codes with 1 to 4 
message bits. If there were another message bit, it would occupy position Xg, 
and position Xg would be occupied by a fourth check bit. In general, c check 
bits will cover a maximum of (2 C - 1) word bits or 2 C > n + 1. Since n = c + 
m, we can write 


2 C > [c + m + 1] (2.21) 

where the notation [c + m + 1] means the smallest integer value of c that 
satisfies the relationship. One can solve Eq. (2.21) by assuming a value of n 
and computing the number of message bits that the various values of c can 
check. (See Table 2.7.) 

If we examine the entry in Table 2.7 for a message that is 1 byte long, m 
= 8, we see that 4 check bits are needed and the total word length is 12 bits. 
Thus we can say that the ratio c/?n is a measure of the code overhead, which 
in this case is 50%. The overhead for common computer word lengths, m, is 
given in Table 2.8. 

Clearly the overhead approaches 10% for long word lengths. Of course, one 
should remember that these codes are competing for efficiency with the parity- 
bit code, in which 1 check bit represents only a 1.6% overhead for a 64-bit 
word length. 

We now return to our (7, 4) SECSED code example to explain how the 
check bits are generated. Hamming developed a much more ingenious and 
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TABLE 2.8 Overhead for Various Word Lengths ( m ) for a Hamming 
SECSED Code 


Code Length, 

n 

Word (Message) 
Length, m 

Number of Check 
Bits, c 

Overhead 
(c/m) X 100% 

12 

8 

4 

50 

21 

16 

5 

31 

38 

32 

6 

19 

54 

48 

6 

13 

71 

64 

7 

11 


efficient design and method for detection and correction. The Hamming code 
positions for the check and message bits are given in Table 2.6, which yields 
the code word ciC 2 &iC 3 & 2 & 3 & 4 - The check bits are calculated by computing 
the exclusive, or ©, of 3 appropriate message bits as shown in the following 
equations: 


ci = bi © Z ?2 © ^4 ( 2 . 22 a) 

C 2 = b\ © £>3 © Z ?4 ( 2 . 22 b) 

C 3 = hi © b^ © Z ?4 ( 2 . 22 c) 

Such a choice of check bits forms an obvious pattern if we write the 3 

check equations below the word we are checking, as is shown in Table 2.9. 
Each parity bit and message bit present in Eqs. (2.22a-c) is indicated by a 
“1” in the respective rows (all other positions are 0). If we read down in each 
column, the last 3 bits are the binary number corresponding to the bit position 
in the word. 

Clearly, the binary number pattern gives us a design procedure for construct¬ 
ing parity check equations for distance 3 codes of other word lengths. Reading 
across rows 3-5 of Table 2.9, we see that the check bit with a 1 is on the left 
side of the equation and all other bits appear as © on the right-hand side. 

As an example, consider that the message bits b^b^ are 1010, in which 
case the check bits are 


TABLE 2.9 Pattern of Parity Check Bits for a Hamming (7, 4) SECSED Code 


Bit positions in word 

Xl 

X2 

X 3 

X 4 

*5 

X 6 

Xl 

Code word 

Cl 

C2 

b\ 

C 3 

b 2 

h 

b4 

Check bit ci 

1 

0 

1 

0 

1 

0 

1 

Check bit C2 

0 

1 

1 

0 

0 

1 

1 

Check bit C3 

0 

0 

0 

1 

1 

1 

1 
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Cl = 1 ©0©0 = 1 

(2.23a) 

C2 = 1©1©0 = 0 

(2.23b) 

C3=0©1©0=1 

(2.23c) 


and the code word is ciC 2 biC 2 b 2 b 2 ,b 4 = 1011010 . 

To check the transmitted word, we recalculate the check bits using Eqs. 
(2.22a-c) and obtain c\ , c' 2 , and c' 3 . The old and the new parity check bits 
are compared, and any disagreement indicates an error. Depending on which 
check bits disagree, we can determine which message bit is in error. Hamming 
devised an ingenious way to make this check, which we illustrate by example. 

Suppose that bit 3 of the message we have been discussing changes from 
a “1” to a “0” because of a noise pulse. Our code word then becomes 
cic 2 b l c 2 b 2 b 2 b 4 - 1011000. Then, application of Eqs. (2.22a-c) yields C 3 , c' 2 , 
and c\ = 110 for the new check bits. Disagreement of the check bits in the 
message with the newly calculated check bits indicates that an error has been 
detected. To locate the error, we calculate error-address bits, £ 3 ^ 1 , as follows: 

d = a © c\ = 1 © 1 = 0 (2.24a) 

e 2 = C'2 © c 2 = 0 © 1 = 1 (2.24b) 

e 3 = c 3 © c' 3 = 1 © 0 = 1 (2.24c) 

The binary address of the error bit is given by e 2 e 2 e\ , which in our example 

is 110 or 6 . Thus we have detected correctly that the sixth position, b 2 , is 
in error. If the address of the error bit is 000, it indicates that no error has 
occurred; thus calculation of e 2 ,e 2 e\ can serve as our means of error detection 
and correction. To correct a bit that is in error once we know its location, we 
replace the bit with its complement. 

The generation and checking operations described above can be derived in 
terms of a parity code matrix (essentially the last three rows of Table 2.9), a 
column vector that is the coded word, and a row vector called the syndrome, 
which is C 3 e 2 e\ that we called the binary address of the error bit. If no errors 
occur, the syndrome is zero. If a single error occurs, the syndrome gives the 
correct address of the erroneous bit. If a double error occurs, the syndrome 
is nonzero, indicating an error; however, the address of the erroneous bit is 
incorrect. In the case of triple errors, the syndrome is zero and the errors are 
not detected. For a further discussion of the matrix representation of Hamming 
codes, the reader is referred to Siewiorek [1992]. 


2.4.4 The Hamming SECDED Code 

The SECDED code is a distance 4 code that can be viewed as a distance 3 
code with one additional check bit. It can also be a triple error-detecting code 
(TED). It is easy to design such a code by first designing a SECSED code and 
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TABLE 2.10 Interpretation of Syndrome for a Hamming (8, 4) 
SECDED Code 


e\ 

ei 

<?3 

e4 

Interpretation 

0 

0 

0 

0 

No errors 

a\ 

02 

a 3 

1 

One error, 010202 

a\ 

C12 

a 3 

0 

Two errors, 010203 , not 000 

0 

0 

0 

1 

Three errors 

0 

0 

0 

0 

Four errors 


then adding an appended check bit, which is a parity bit over all the other 
message and check bits. An even-parity code is traditionally used; however, if 
the digital electronics generating the code word have a failure mode in which 
the chip is burned out and all bits are 0 , it will not be detected by an even- 
parity scheme. Thus odd parity is preferred for such a case. We expand on the 
(7, 4) SECSED example of the previous section and affix an additional check 
bit (czt) and an additional syndrome bit (e/Q to obtain a SECDED code. 

C 4 = ci © ci © b\ © C 3 © hi © © bn (2.25) 

e 4 - C 4 © c\ (2.26) 

The new coded word is i:\C 2 b \C 2 b 1 b\b 4 C 4 . The syndrome is interpreted as given 
in Table 2.10. 

Table 2.8 can be modified for a SECDED code by adding 1 to the code 
length column and 1 to the check bits column. The overhead values become 

63%, 38%, 22%, 15%, and 13%. 


2.4.5 Reduction in Undetected Errors 

The probability of an undetected error for a SECSED code depends on the 
error-correction philosophy. Either a nonzero syndrome can be viewed as a 
single error—and the error-correction circuitry is enabled—or it can be viewed 
as detection of a double error. Since the next section will treat uncorrected error 
probabilities, we assume in this section that the nonzero syndrome condition 
for a SECSED code means that we are detecting 1 or 2 errors. (Some people 
would call this simply a distance 3 double error-detecting, or DED, code.) In 
such a case, the error detection fails if 3 or more errors occur. We discuss these 
probability computations by using the example of a code for a 1 -byte message, 
where m = 8 and c = 4 (see Table 2.8). If we assume that the dominant term in 
this computation is the probability of 3 errors, then we can see Eq. (2.1) and 
write 


P' ue =B(3:12) = 220q\l-q)' 


(2.27) 
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TABLE 2.11 Evaluation of the Reduction in Undetected 
Errors for a Hamming SECSED Code: Eq. (2.25) 


Bit Error Probability, 

q 

Improvement Ratio: 

Pue/P U e 

10~ 4 

3.640 x 10 6 

io - 5 

3.637 x 10 8 

10~ 6 

3.636 x 10 10 

io - 7 

3.636 x 10 12 

10~ 8 

3.636 x 10 14 


Following simplifications similar to those used to derive Eq. (2.7), the unde¬ 
tected error ratio becomes 


P ml P' ue = 2(1+ 9q)/ 55q 2 (2.28) 

This ratio is evaluated in Table 2.11. 

2.4.6 Effect of Coder-Decoder Failures 

Clearly, the error improvement ratios in Table 2.11 are much larger than those 
in Table 2.3. We now must include the probability of the generator/checker 
circuitry failing. This should be a more significant effect than in the case of 
the parity-bit code for two reasons. First, the undetected error probabilities are 
much smaller with the SECSED code, and second, the generator/checker will 
be more complex. A practical circuit for checking a (7, 4) SECSED code is 
given in Wakerly [p. 298, 1990] and is reproduced in Fig. 2.7. For the reader 
who is not experienced in digital circuitry, some explanation is in order. The 
three 74LS280 ICs (U\, Ui, and f/ 3 ) are similar to the SN74180 shown in Fig. 
2.4. Substituting Eq. (2.22a) into Eq. (2.24a) shows that the syndrome bit e\ 
is dependent on the © of c\, b\, Zd> and b 4 , and from Table 2.6 we see that 
these are bit positions x\, x 3 , x$, and xj, which correspond to the inputs to 
U i- Similarly, Ui and t/ 3 compute e 4 and e 3 . The decoder U 4 (see Appendix 
C6.3) activates one of its 8 outputs, which is the address of the error bit. The 
8 output gates (f/ 5 and U&) are exclusive or gates (see Appendix C; only 7 are 
used). The output of the U 4 selects the erroneous bit from the bus DU(l-7), 
complements it (performing a correction), and passes through the other 6 bits 
unchanged. Actually the outputs DU(l-7) are all complements of the desired 
values; however, this is simply corrected by a group of inverters at the output 
or inversion of the next stage of digital logic. For a check-bit generator, we 
can use three 74LS280 chips to generate e\, e 4 , and <? 3 . 

We can compute the reliability of the generator/checker circuitry by again 
using the IC failure rate model of Section B3.3, \i, = 0.004 \Zg. We assume 
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DU[l-7] 



Figure 2.7 Error-correcting circuit for a Hamming (7, 4) SECSED code [Reprinted 
by permission of Pearson Education, Inc., Upper Saddle River, NJ 07458; from Wak- 
erly, 2000, p. 298], 


that any failure in the IC causes system failure, so the reliability diagram is a 
series structure and the failure rates add. The computation is detailed in Table 
2.12. (See also Fig. 2.7.) 

Thus the failure rate for the coder plus decoder is X = 13.58 x 10 8 , which 
is about four times as large as that for the parity bit case (2 x 1.67 x 10 s ) 
that was calculated previously. 

We now incorporate the possibility of generator/checker failure and how it 
affects the error-correction performance in the same manner as we did with the 
parity-bit code in Eqs. (2.8)—(2.11). From Table 2.8 we see that a 1-byte (8-bit) 
message requires 4 check bits; thus the SECSED code is (12, 8). The example 
developed in Table 2.12 and Fig. 2.7 was for a (7, 4) code, but we can easily 
modify these results for the (12, 8) code we have chosen to discuss. First, let 
us consider the code generator. The 74LS280 chips are designed to generate 
parity check bits for up to an 8-bit word, so they still suffice; however, we now 
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need to generate 4 check bits, so a total of 4 will be required. In the case of the 
checker (see Fig. 2.7), we will also require four 74LS280 chips to generate the 
y-syndrome bits. Instead of a 3-to-8 decoder we will need a 4-to-16 decoder 
for the next stage, which can be implemented by using two 74LS138 chips 
and the appropriate connections at the enable inputs (Gl, G2A, and G2B), as 
explained in Appendix C6.3. The output stage composed of 74LS86 chips will 
not be required if we are only considering error detection, since the nonerror 
output is sufficient for this. Thus we can modify Table 2.12 to compute the 
failure rate that is shown in Table 2.13. Note that one could argue that since we 
are only computing the error-detection probabilities, the decoders and output 
correction EXOR gates are not needed, and only an OR gate with the syndrome 
inputs is needed to detect a 0000 syndrome that indicates no errors. 

Using the information in Table 2.13 and Eq. (2.27), we obtain an expression 
similar to Eq. (2.10), as follows: 

P' ue = e- u 220q 1 2 3 * * (\- q) 9 + (1 - e u ) (2.29) 

where X is 19.50 x 10 8 failures per hour and t is 12/3600/1. 

We formulate the improvement ratio by dividing Eq. (2.29) by Eq. (2.12); 
the ratio is given in Table 2.14 and is plotted in Fig. 2.8. The data presented 
in Table 2.11 is also plotted in Fig. 2.8 and represents the line labeled B —» 
which represents the case for a nonfailing generator/checker. 

2.4.7 How Coder-Decoder Failures Affect SECSED Codes 

Because the Hamming SECSED code results in a lower value for undetected 
errors than the parity-bit code, the effect of chip failures is even more pro¬ 
nounced. Of course the coding is still a big improvement, but not as much as 
one would predict. In fact, by comparing Figs. 2.8 and 2.5 we see that for B 
=300, the parity-bit scheme is superior to the SECSED scheme for values of 
q less than about 2 x 10 7 ; for B =1,200, the parity-bit scheme is superior to 
the SECSED scheme for values of q less than about 10 7 . The general con¬ 
clusion is that for more complex error detection schemes, one should evaluate 
the effects of generator/checker failures, since these may be of considerable 
importance for small values of q. (Chip-specific failure rates may be required.) 

More generally, we should compute whether generator/checker failures sig¬ 
nificantly affect the code performance for the given values of q and B. If such 
failures are significant, we can consider the following alternatives: 

1. Consider a simpler coding scheme if q is very small and B is low. 

2. Consider other coding schemes if they use simpler generator/checker 
circutry. 

3. Use other digital logic designs that utilize fewer but larger chips. Since 

the failure rate is proportional to v / g, larger-scale integration improves 

reliability. 
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TABLE 2.14 The Reduction in Undetected Errors from a Hamming (12, 8) DED 
Code Including the Effect of Coder-Decoder Failures 


Bit Error 
Probability 

q 

Improvement Ratio: P ue /P' ue 

for Several Transmission Rates 

300 Bits/Sec 

1,200 Bits/Sec 

9,600 Bits/Sec 

56,000 Bits/Sec 

10" 4 

3.608 X 10 6 

3.629 x 10 6 

3.637 x 10 6 

3.638 x 10 6 

10" 5 

3.88 x 10 7 

1.176 x 10 8 

2.883 x 10 8 

3.480 x 10 8 

10" 6 

4.34 x 10 6 

1.738 x 10 7 

1.386 x 10 8 

7.939 x 10 8 

10" 7 

4.35 x 10 5 

1.739 x 10 6 

1.391 x 10 7 

8.116 x 10 7 

10" 8 

4.35 x 10 4 

1.739 x 10 5 

1.391 x 10 6 

8.116 x 10 6 


4. Seek to lower IC failure rates via improved derating, bum-in, use of high 
reliability ICs, and so forth. 

5. Seek fault-tolerant or redundant schemes for code generator and code 
checker circuitry. 
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Figure 2.8 Improvement ratio of undetected error probability from a SECSED code, 
including the possibility of coder-decoder failure. B is the transmission rate in bits per 
second. 
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2.5 ERROR-DETECTION AND RETRANSMISSION CODES 

2.5.1 Introduction 

We have discussed both error detection and correction in the previous sections 
of this chapter. However, performance metrics (the probabilities of undetected 
errors) have been discussed only for error detection. In this section, we intro¬ 
duce metrics for evaluating the error-correction performance of various codes. 
In discussing the applications for parity and Hamming codes, we have focused 
on information transmission as a typical application. Clearly, the implementa¬ 
tions and metrics we have developed apply equally well to memory scheme 
protection, cache checking, bus-transmission checks, and so forth. Thus, when 
we again use a data-transmission data application to discuss error correction, 
the results will also apply to the other application. 

The Hamming error-correcting codes provide a direct means of error cor¬ 
rection; however, if our transmission channel allows communication in both 
directions (bidirectional), there is another possibility. If we detect an error, we 
can send control signals back to the source to ask for retransmission of the 
erroneous byte, work, or code block. In general, the appropriate measure of 
error correction is the reliability (probability of no error). 

2.5.2 Reliability of a SECSED Code 

To discuss the reliability of transmission, we again focus on 1 transmitted byte 
and compute the reliability with and without error correction. The reliability 
of a single transmitted byte without any error correction is just the probability 
of no errors occurring, which was calculated as the second term in Eq. (2.5). 

R=(\-qf (2.30) 

In the case of a SECSED code (12, 8), single errors are corrected; thus the 
reliability is given by 


R-P (no errors + 1 error) (2.31) 

and since these are mutually exclusive events, 

R = P (no errors) + P(\ error) (2.32) 

the binomial distribution yields 

R' = (1 - q) U + \2q(\ - q) n = (1 - qf\ 1 + 11^) (2.33) 

Clearly, R' > R\ however, for small values of q. both are very close to 1, 
and it is easier to compare the unreliability U = 1 - R. Thus a measure of the 
improvement of a SECSED code is given by 
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TABLE 2.15 Evaluation of the Reduction in Unreliability for 
a Hamming SECSED Code: Eq. (2.35) 


Bit Error Probability, 

<? 

Improvement Ratio: 

1 - U 

1 - U' 

1(T 4 

6.61 x 10 2 

1(T 5 

6.61 x 10 3 

1(T 6 

6.61 x 10 4 

1(T 7 

6.61 x 10 5 

1(T 8 

6.61 x 10 6 


(1 - U)/( 1 - U') = [1 - (1 - qfVVl - (1 - q) U { 1 + 1 lq)] (2.34) 

and approximating this for small q yields 

(1 - t/)/(l -U') = 8/121# (2.35) 

which is evaluated for typical values of q in Table 2.15. 

The foregoing evaluations neglected the probability of IC generator and 
checker failure. However, the analysis can be broadened to include these effects 
as was done in the preceding sections. 

2.5.3 Reliability of a Retransmitted Code 

If it is possible to retransmit a code block after an error has been detected, one 
can improve the reliability of the transmission. In such a case, the reliability 
expression becomes 

R' = /Ano error + detected error and no error on retransmisson) (2.36) 
and since these are mutually exclusive events and independent events, 

R' = Pino error) + /’(detected error) X Pino error on retransmission) (2.37) 

Since the error probabilities on initial transmission and on retransmission 
are the same, we obtain 

R' = P( no error)[1 + /’(detected error)] (2.38) 

For the case of a parity-bit code, we transmit 9 bits; the probability of detect¬ 
ing an error is approximately the probability of 1 error. Substitution in Eq. 
(2.38) yields 
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R' = (l-q) 9 [l + 9q(l-q) 8 ] (2.39) 

Comparing the ratio of unreliabilities yields 

(1 - 0/(1 - U') = [1 - (1 - q) 8 ]/[ 1 - [(1 - qf[ 1 + 9q{\ - q) 8 ]]] (2.40) 

and simplification for small q yields 

(1 - C)/(l - O) = 8q/[9q 2 - 828 < 7 3 ] (2.41) 

Similarly, we can use a Hamming distance 3 code (12, 8 ) to detect up to 
2 errors and retransmit. In this case, the probability of detecting an error is 
approximately the probability of 1 or 2 errors. Substitution in Eq. (2.38) yields 

R' = (1 - q) n [ 1 + ( 12 < 7(1 - q) n + 66q 2 {\ - 4 ) 10 )] (2.42) 

and the unreliability ratio becomes 

(1 - 0/(1 - U ') = [1 - (1 - q) 8 ]/[ 1 - [(1 - 0 12 [1 + ( 12<?(1 - q) n 

+ 66q 2 (\ - O 10 ]]] (2-43) 

and simplihcation for small q yields 

(1 - 0/(1 - O) = 8.7/[78<r - 66 < 7 3 ] (2.44) 

Equations (2.41) and (2.44) are evaluated in Table 2.16 for typical values 
of q. Comparison of Tables 2.15 and 2.16 shows that both retransmit schemes 
are superior to the error correction of a SECSED code, and that the parity- 
bit retransmit scheme is the best. However, retransmit has at least a 100% 
overhead penalty, and Table 2.8 shows typical SECSED overheads of 11-50%. 


TABLE 2.16 Evaluation of the Improvement in Reliability by Code 
Retransmission for Parity and Hamming d = 3 Code 


Bit Error Probability, 

q 

Parity-Bit 

Retransmission 

(i - t/)/(i - u'y 

Eq. (2.41) 

Hamming d = 3 
Retransmission 

(1 - 0/(1- U'y. 

Eq. (2.44) 

IQ - 4 

8.97 x 10 3 

1.026 x 10 3 

IQ - 5 

8.90 X 10 4 

1.026 x 10 4 

IQ - 6 

8.89 x 10 5 

1.026 x 10 5 

1(T 7 

8.89 x 10 6 

1.026 x 10 6 

1(T 8 

8.89 x 10 7 

1.026 x 10 7 
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The foregoing evaluations neglected the probability of IC generator and 
checker failure as well as the circuitry involved in controlling retransmission. 
However, the analysis can be broadened to include these effects, and a more 
detailed comparison can be made. 


2.6 BURST ERROR-CORRECTION CODES 
2.6.1 Introduction 

The codes previously discussed have all been based on the assumption that the 
probability that bit b, is corrupted by an error is largely independent of whether 
bit /?, i is correct or is in error. Furthermore, the probability of a single bit 
error, q, is relatively small; thus the probability of more than one error in a 
word is quite small. In the case of a burst error, the probability that bit /?, is 
corrupted by an error is much larger if bit /?, _ i is incorrect than if bit b, \ is 
correct. In other words, the errors commonly come in bursts rather than singly. 
One class of applications that are subject to burst errors are rotational magnetic 
and optical storage devices (e.g., music CDs, CD-ROMs, and hard and floppy 
disk drives). Magnetic tape used for pictures, sound, or data is also affected 
by burst errors. 

Examples of the patterns of typical burst errors are given in the four 12-bit 
messages shown in the forthcoming equations. The common notation 

is used where b represents a correct message bit and x represents an erroneous 
message bit. (For the purpose of identification, assume that the bits are num¬ 
bered 1-12 from left to right.) 


mi= bbbxxbxbbbbb 

(2.45a) 

m 2 = bxbxxbbbbbbb 

(2.45b) 

m 3 = bbbbxbxbbbbb 

(2.45c) 

ni 4 = bxxbbbbbbbbb 

(2.45d) 


Messages 1 and 2 each have 3 errors that extend over 4 bits (e.g., in mi 
the error bits are in positions 4, 5, and 7); we would refer to them as bursts 
of length 4. In message 3, the burst is of length 3; in message 4, the burst is 
of length 2. In general, we call the burst length t. The burst length is really a 
matter of definition; for example, one could interpret messages 1 and 2 as 2 
bursts—one of length 1 and one of length 2. In practice, this causes no con¬ 
fusion, for t is a parameter of a burst code and is fixed in the initial design of 
the code. Thus if t is chosen as length 4, all 4 of the messages would have 1 
burst. If t is chosen as length 3, messages 1 and 2 would have two bursts, and 
messages 3 and 4 would have 1 burst. 

Most burst error codes are more complex than the Hamming codes that 
were just discussed; thus the remainder of this chapter will present a succinct 
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introduction to the basis of such codes and will briefly introduce one of the 
most popular burst codes: the Reed-Solomon code [Golumb, 1986]. 

2.6.2 Error Detection 

We begin by giving an example of a burst error-detection code [Arazi, 1988]. 
Consider a 12-bit-long code word (also called a code block or code vector, V ), 
which includes both message and check bits as follows: 

V = G 1 X 2 X 3 X 4 X 5 X 6 X 7 X 8 T 9 X 10 T 11 * 12 ) (2.46) 

Let us choose to deal with bursts of length t — 4. Equations for calculating the 
check bits in terms of the message bits can be developed by writing a set of 
equations in which the bits are separated by t positions. Thus for t = 4, each 
equation contains every fourth bit. 


xi ©X5 ©X9 =0 

(2.47a) 

*2 ©*6 ©X10 = 0 

(2.47b) 

X3 ©X7 ©xn =0 

(2.47c) 

X4 © Xg ©X12 = 0 

(2.47d) 


Each bit appears in only one equation. Assume there is either 0 or only 1 
burst in the code vector (multiple bursts in a single word are excluded). Thus 
each time there is 1 erroneous bit, one of the four equations will equal 1 rather 
than 0, indicating a single error. To illustrate this, suppose *2 is an error bit. 
Since we are assuming a burst length of 4 and at most 1 burst per code vector, 
the only other possible erroneous bits are X3, X4, and X5. (At this point, we don’t 
know if 0, 1, 2, or 3 errors occur in bits 3-5.) Examining Eq. (2.47b), we see 
that it is not possible for *6 or Xjo to be erroneous bits, so it is not possible 
for 2 errors to cancel out in evaluating Eq. (2.47b). In fact, if we analyze the 
set of Eqs. (2.47a-d), we see that the number of nonzero equations in the set 
is equal to the number of bit errors in the burst. 

Since there are 4 check equations, we need 4 check bits; any set of 4 bits 
in the vector can be chosen as check bits, provided that 1 bit is chosen from 
each equation (2.47a-d). For clarity, it probably makes sense to choose the 4 
check bits as the first or last 4 bits in the vector; such a choice in any type of 
code is referred to as a systematic code. Suppose we choose the first 4 bits. 
We then obtain a (12, 8 ) systematic burst code of length 4, where c,- stands for 
a check bit and £>, a message bit. 

V = (c 1 C 2 C 3 C 461 b 2 b^b ^b^b^b^jb^) (2.48) 

A moment’s reflection shows that we have now maneuvered Eqs. (2.47a-d) 
so that with cs and bs substituted for the xs, we obtain 
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Ci © b\ © b 5 = 0 

(2.49a) 


C2 © hi © bs - 0 

(2.49b) 


C3 © i>3 © b-j = 0 

(2.49c) 


C4 © i>4 © = 0 

(2.49d) 

which can be used to compute the check bits. These equations are therefore 
the basis of the check-bit generator, which can be done with 74180 or 74280 
IC chips. 

The same set of equations form the basis of the error-checking circuitry. 
Based on the fact that the number of nonzero equations in the set of Eqs. 
(2.47a-d) is equal to the number of bit errors in the burst, we can modify Eqs. 
(2.47a-c) so that they explicitly yield bits of a syndrome vector, e^cue^e^. 


e\ = xi © X5 © X9 

(2.50a) 


e2 = X 2 © *6 © *10 

(2.50b) 


e3 = X3 © xi © Xu 

(2.50c) 


e 4 = X4 © *8 © X 12 

(2.50d) 


The nonerror condition occurs when all the syndrome bits are 0. In general, 
the number of errors detected is the arithmetic sum: e\ + + ^3 + c 4 . Note that 

because we originally chose t - 4 in this design, no more than 4 errors can 
be detected. Again, the checker can be done with 74180 or 74280 IC chips. 
Alternatively, one can use individual gates. To generate the check bits, 4 EXOR 
gates are sufficient; 8 EXOR gates and an output OR gate are sufficient for 
error checking (cf. Fig. 2.2). However, if one wishes to determine how many 
errors have occurred, the output OR gate in the checker can be replaced by a 
few half-adders or full-adders to compute the arithmetic sum: e\ + + C3 + £4. 

We can now state some properties of burst codes that were illustrated by the 
above discussion. The reader is referred to the references for proof [Arazi, 1988]. 

Properties of Burst Codes 

1. For a burst length of t, t check bits are needed for error detection. (Note: 
this is independent of the message length in.) 

2. For m message bits and a burst length of t, the code word length n = m 
+ t. 

3. There are t check-bit equations: 

(a) The first check-bit equation starts with bit 1 and contains all the bits 
that are t + 1, 2 t + 1, ... kt + 1 (where kt + 1 < n). 

(b) The second check-bit equation starts with bit 2 and contains all the 
bits that are t + 2, 2 t + 2, ... kt + 2 (where kt + 2 < n). 


(t) The r'th check-bit equation starts with bit t and contains all the bits 
that are 21 , 3 1, ... kt (where kt < n). 
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t-bits 


(b) 

Figure 2.9 Burst error-detection circuitry using an LFSR: (a) encoder; (b) decoder. 
[Reprinted by permission of MIT Press, Cambridge, MA 02142; from Arazi, 1988, p. 
108.] 


4. The EXOR of all the bits in 3a should = 0 and similarly for properties 

3 b, ... 3 1. 

5. The word length n need not be an integer multiple of f, but for practi¬ 
cality, we assume that it is. If necessary, the word can be padded with 
additional dummy bits to achieve this. 

6. Generation and checking for a burst error code (as well as other codes) 
can be realized by a linear feedback shift register (LFSR). (See Fig. 2.9.) 

7. In general, the LFSR has a delay of t x the shift time. 

8. The generating and checking for a burst error code can be realized by 
an EXOR tree circuit (cf. Fig. 2.2), in which the number of stages is 
< log 2 (f) and the delay is < log 2 (f) x the EXOR gate-switching time. 

These properties are explored further in the problems at the end of this chap¬ 
ter. To summarize, in this section we have developed the basic equations for 
burst error-detection codes and have shown that the check-bit generator and 
checker circuitry can be implemented with EXOR trees, parity-bit chips, or 
LFSRs. In general, the LFSR implementation requires less hardware, but the 
delay time is linear in the burst length t. In the case of EXOR trees, there is 
more hardware needed; however, the time delay is less, for it increases pro¬ 
portionally to the log of t. In either case, for the modest size t = 4 or 5, the 
differences in time delay and hardware are not that significant. Both designs 
should be attempted, and a choice should be made. 

The case of burst error correction is more difficult. It is discussed in the 
next section. 
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2.6.3 Error Correction 

We now state some additional properties of burst codes that will lead us to 
an error-correction procedure. In general, these are properties associated with 
a shifting of the error syndrome of a burst code and an ancient theorem of 
number theory related to the mod function. The theorem from number theory 
is called the Chinese Remainder Theorem [Rosen, 1991, p. 134] and was first 
given as a puzzle by the first-century Chinese mathematician Sun-Tsu. It will 
turn out that the method of error correction will depend on first locating a 
region in the code word of t consecutive bits that contains the start of the error 
burst, followed by pinpointing which of these t bits is the start of the burst. The 
methodology is illustrated by applying the principles to the example given in 
Eq. (2.46). For a development of the theory and proofs, the reader is referred 
to Arazi [1988] and Rosen [1991]. 

The error syndrome can be viewed as a cyclic shift of the burst error pat¬ 
tern. For example, if we assume a single burst and t = 4, then substitution 
of error pattern for X 1 X 2 X 3 X 4 into Eqs. (2.50a-d) will yield a particular syn¬ 
drome pattern. To compute what the syndrome would be, we note that if 
X1X2X3X4 = bbbb, all the bits are correct and the syndrome must be 0000 . 
If bit 1 is in error (either changed from a correct 1 to an erroneous 0 or from a 
correct 0 to an erroneous 1), then Eq. (4.50a) will yield a 1 for e\ (since there 
is only 1 burst, bits X 5 -X 12 must be all valid bs). Suppose the error pattern is 
V1X2X3X4 = xbxx, then all other bits in the 12 -bit vector are b and substitution 
into Eqs. (2.50a-d) yields 


e\ =x ©X5 ©X9 =1 

(2.51a) 

ei = b © X(y © xio = 0 

(2.51b) 

£3 =X ®X 7 ©Xu = 1 

(2.51c) 

C 4 = X ©Xg © X12 = 1 

(2.5 Id) 


which is a syndrome pattern exe^e^e^ = 1011. Similarly, error pattern 

X4X5X6X7 = xbxx, where all other bits are b, yields syndrome equations as 
follows: 


ei =xi ©b ©X9 =0 

(2.52a) 

e 2 =X2 ©X ©Xio = 1 

(2.52b) 

e3 =X3 ©X @X|, -• 1 

(2.52c) 

64 = X © Xs © X12 = 1 

(2.52d) 


which is a syndrome pattern <?i <? 2 <? 3<?4 = 0111. We can view 0111 as a pattern 
that can be transformed into 1011 by cyclic-shifting left (end-around-rotation 
left) three times. We will show in the following material that the same syn¬ 
drome is obtained by shifting the code vector right four times. 

We begin marking a burst error pattern with the first erroneous bit in the 
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word; thus burst error patterns always start with an x. Since the burst is t bits 
long, the syndrome equations (2.50a-d) include bits that differ by t positions. 
Therefore, if we shift the burst error pattern in the code vector by t positions 
to the right, the burst error pattern generates the same syndrome. There can 
be at most u placements of the burst pattern in a code vector that results in 
the same syndrome; if the code vector is n bits long, u is the largest integer 
such that ut < n. Without loss of generality, we can always pad the message 
bits with dummy bits such that ut = n. We define the mod function x mod y 
as the remainder that is obtained when we divide the integer x by the integer 
y. Thus, if ut = n, we can then say that n mod u = 0. These relationships will 
soon be used to devise an algorithm for burst error correction. 

The location of the start of the burst error pattern in a word is related to the 
amount of shift (end-around and cyclic) of the pattern that is observed in the 
syndrome. We can illustrate this relationship by using the burst pattern xbxx 
as an example, where xbxx is denoted by 1011: meaning incorrect, correct, 
incorrect, incorrect. In Table 2.17, we illustrate the relationship between the 
start of the error burst and the rotational shift (end-around shift) in the detected 
error syndrome. We begin by renumbering the code vector, Eq. (2.46), so it 
starts with bit 0: 


V = (X 0 X 1 V 2 X 3 V 4 X 5 X 6 X 7 VSX 9 X 10 X 11 ) (2.53) 

A study of Table 2.17 shows that the number of syndrome shifts is related 
to the bit number by (bit number) mod 4. For example, if the burst starts with 
bit no. 3, we have 3 mod 4 (which is 3), so the syndrome is the error pattern 
shifted 3 places to the right. If we want to recover the syndrome, we shift 3 
places to the left. In the case of a burst starting with bit no. 4, 4 mod 4 is 0, 
so the syndrome pattern and the burst pattern agree. 

Thus, if we know the position in the code word at which the burst starts 
(defined as x), and if the burst length is t. then we can obtain the burst pattern by 
shifting the syndrome x mod t places to the left. Knowing the starting position 
of the burst (x) and the burst pattern, we can correct any erroneous bits. Thus 
our task is now to find x. 

The procedure for solving for x depends on the Chinese Remainder The¬ 
orem, a previously mentioned mathematical theorem in number theory. This 
theorem states that if p and q are relatively prime numbers (meaning their only 
common factor is 1), and if 0 < x < (pq - 1), then knowledge of x mod p and x 
mod q allows us to solve for x. We already have one equation: x mod t\ to gen¬ 
erate another equation, we define u =21-1 and calculate x from x mod u [Arazi, 
1988]. Note that t and 2t - 1 are relatively prime since if a number divides t, 
it also divides 21 but not 2t - 1. Also, we must show that 0 < x < (tu - 1); 
however, we already showed that tu < n. Substitution yields 0 < x < (n - 1), 
which must be true since the latest bit position to start a burst error (x) for a 
burst of length / is n - t < 11 - 1. 

The above relationships show that it is possible to solve for the beginning 
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of the burst error x and the burst error pattern. Given this information, by 
simply complementing the incorrect bits, error correction is performed. The 
remainder of this section details how we set up equations to calculate the check 
bits (generator) and to calculate the burst pattern and location (checker); this 
is done by means of an illustrative example. One circuit implementation using 
shift registers is discussed as well. 

The number of check bits is equal to u + t, and since u = 2c- 1 and n =ut, 
the number of message bits is determined. We formulate check bit equations 
in a manner analogous to that used in error checking. 

The following example illustrates how the two sets of check bits are gen¬ 
erated, how one formulates and solves for x mod u and x mod t to solve for 
x, and how the burst error pattern is determined. In our example, we let t = 3 
and calculate u =21 - 1 = 2x3-1 =5. In this case, the word length n = u x t 
=5x3 =15. The code vector is given by 


V = (X0X1X2X3X4X5X6X7X8X9X10X11X12V13X14) (2.54) 

The t + u check equations are generated from a set of u equations that form the 
auxiliary syndrome. For our example, the u =5 auxiliary syndrome equations 


are: 

so = xo © X 5 © X 10 (2.55a) 

51 — x\ © A'g © x 11 (2.55b) 

5 2 = X 2 © X 7 © X 12 (2.55c) 

5 3 = X 3 © xg © X 13 (2.55d) 

5 4 = X 4 © X 9 © X 14 (2.55e) 

and the set of t =3 equations that form the syndrome are 

e\ =xo ©X 3 ©X 6 ©X 9 ©X 12 (2.56a) 

ei =xi ©X 4 ©X 7 ©xio ©X 13 (2.56b) 

63 =X 2 ©X 5 @X 8 ©XU ©Xl 4 (2.56c) 


If we want a systematic code, we can place the 8 check bits at the beginning 
or the end of the word. Let us assume that they go at the end (X7-X14) and that 
these check bits C 0 -C 7 are calculated from Eqs. (2.55a-e) and (2.56a-c). The 
first 7 bits (xo-xg) are message bits, and the transmitted word is 


V = (b () b 162636465^6^0X1 C2C3C4C5C6C7) (2.57) 

As an example, let us assume that the message bits b^-bf, are 1011010. 
Substitution of these values in Eqs. (2.55a-e) and (2.56a-c) that must initially 
be 0 yields a set of equations that can be solved for the values of C 0 -C 7 . One can 
show by substitution that the values C 0 -C 7 =10000010 satisfy the equations. 
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(Shortly, we will describe code generation circuitry that can solve for the check 
bits in a straightforward manner.) Thus the transmitted word is 

V t = (b () bib 2 bibnb 5 b( > ) = 1011010 for the message part (2.58a) 

V t = (C 0 C 1 C 2 C 3 C 4 C 5 C 6 C 7 ) = 10000010 for the check part (2.58b) 

Let us assume that the received word is 

V r = (101101000100010) (2.59) 

We now begin the error-recovery procedure by calculating the auxiliary syn¬ 
drome by substitution of Eq. (2.59) in Eqs. (2.55a-e) yielding 


so = l©l©0 = 0 

(2.60a) 

si = 0©0©0 = 0 

(2.60b) 

52 = 1 © 0 © 1 = 1 

(2.60c) 

S3 = l©0©0 = 0 

(2.60d) 

S4 = 0©1©0=1 

(2.60e) 


The fact that the auxiliary syndrome is not all 0’s indicates that 1 or more 
errors have occurred. In fact, since two equations are nonzero, there are two 
errors. Furthermore, it can be shown that the burst error pattern associated with 
the auxiliary syndrome must always start with an x and all bits > t must be 
valid bits. Thus, the burst error pattern (since t = 3) must be xllbb = 1??00. 
This means the auxiliary syndrome pattern should start with a 1 and end in 
two 0’s. The unique solution is that the auxiliary syndrome pattern must be 
shifted to the left two places yielding 10100 so that the first bit is 1 and the 
last two bits are 0. In addition, we deduce that the real syndrome (and the burst 
pattern) is 101. Similarly, Eqs. (2.56a-c) yield 

eo=l©l©0©l©0 = l (2.61a) 

ei = 0©0©0©0©l = l (2.61b) 

e 2 =l©l©0©0©0 = 0 (2.61c) 

Thus, to get the known syndrome—found from Eqs. (2.61a-c)—to be 101, we 
must shift the real syndrome left one place. Based on these shift results, our 
two mod equations become 

for u: x mod u= x mod 5 = 2 (2.62a) 

for t: x mod t = x mod 3=1 (2.62b) 

We now know the burst pattern 101 and have two equations (2.62a, b) that 
can be solved for the start of the burst pattern given by x. Substitution of trial 
values into Eq. (2.62a) yields x= 2, which satisfies (2.62a) but not (2.62b). The 
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Figure 2.10 Basic error decoder for u = 5 and 1 3 burst code based on three shift 

registers. (Additional circuitry is needed for a complete decoder.) The input (IN) is a 
train of shift pulses. [Reprinted by permission of MIT Press, Cambridge, MA 02142; 
from Arazi, 1988, p. 123.] 


next value that satisfies Eq. (2.62a) is x = 7, and since this value also satisfies 
Eq. (2.62b), it is a solution. We conclude that the burst error started at position 
x = 7 (the eighth bit, since the count starts with 0) and that is was xbx, so the 
eighth and tenth bits must be complemented. Thus the received and corrected 
versions of the code vector are 

y r = (101101000100010) (2.63a) 

m 

V c = (101101010000010) (2.63b) 

Note that Eqs. (2.63a, b) agrees with Eqs. (2.58a, b). 

One practical decoder implementation for the u = 5 and t = 3 code discussed 
above is based on three shift registers (Rl, R2, and R3) shown in Fig. 2.10. 
Such a circuit is said to employ linear feedback shift registers (LFSR). 

Initially, Rl is loaded with the received code vector, R2 is loaded with the 
auxiliary syndrome calculated from EXOR trees or parity-bit chips that imple¬ 
ment Eqs. (2.60a-e), and R3 is loaded with the syndrome calculated from 
EXOR trees or parity-bit chips that implement Eqs. (2.61a-c). Using our pre¬ 
vious example, Rl is loaded with Eqs. (2.58a, b), R2 with 00101, and R3 with 
110. R2 and R3 are shifted left until the left 3 bits of R2 agree with R3, and 
the leftmost bit is a 1. A count of the number of left shifts yields the start posi¬ 
tion of the burst error (jc), and the contents of R3 is the burst pattern. Circuitry 
to complement the appropriate bits results in error correction. In the circuit 
shown, when the error pattern is recovered in R3, Rl has the burst error in 
the left 3 bits of the register. If correction is to be performed by shifting, the 
leftmost 3 bits in Rl and R3 can be EXORed and restored in Rl. This would 
assume that the bits shifted out of Rl go to a storage register or are circulated 
back to Rl and, after error detection, the bits in the repaired word are shifted 
to their proper position. For more details, see Arazi [1988]. 
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Figure 2.11 Basic encoder circuit for u = 5 and t = 3 burst code based on three shift 
registers. (Additional circuitry is needed for a complete decoder.) The input (IN) is 
the information vector (message). [Reprinted by permission of MIT Press, Cambridge, 
MA 02142; from Arazi, 1988, p. 125.] 


One can also generate the check bits (encoder) by using LFSRs. One such 
circuit for our code example is given in Fig. 2.11. For more details, see Arazi 
[1988]. 

2.7 REED-SOLOMON CODES 

2.7.1 Introduction 

One technique to mitigate against burst errors is to simply interleave data so 
that a burst does not affect more than a few consecutive data bits at a time. A 
more efficient approach is to use codes that are designed to detect and correct 
burst errors. One of the most popular types of error-correcting codes is the 
Reed-Solomon (RS) code. This code is useful for correcting both random and 
burst errors, but it is especially popular in burst error situations and is often 
used with other codes in a convolutional code (see Section 2.8). 

2.7.2 Block Structure 

The RS code is a block-type code and operates on multiple rather than indi¬ 
vidual bits. Data is processed in a batch called a block instead of continuously. 
Each block is composed of n symbols, each of which has m bits. The block 
length n = 2 m - 1 symbols. A message is k symbols long, and n-k additional 
check symbols are added to allow error correction of up to t error symbols. 
Block length and symbol sizes can be adjusted to accommodate a wide range 
of message sizes. For an RS code, one can show that 

(n - k ) = 2 1 for n-k even (2.64a) 

(. n - k) = 2t + 1 for n-k odd (2.64b) 

minimum distance = <:/ mm = 2r + 1 symbols (2.64c) 

As a typical example [AFIA Applications Note], we will assume n =255 
and in =8 (a symbol is 1 byte long). Thus from Eq. (2.64a), if we wish to 
correct up to 10 errors, then t =10 and (n - k) = 20. We therefore have 235 
message symbols and 20 check symbols. The code rate (efficiency) of the code 
is given by k/n, which is (235/255) = 0.92 or 92%. 
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2.7.3 Interleaving 

Interleaving is a technique that can be used with RS and other block codes to 
improve performance. Individual bits are shifted to spread them over several 
code blocks. The effect is to spread out long bursts so that error correction 
can occur even for code bursts that are longer than t bits. After the message 
is received, the bits are deinterleaved. 


2.7.4 Improvement from the RS Code 

We can calculate the improvement from the RS code in a manner similar to 
that which was used in the Hamming code. Now, the P ue is the probability 
of an undetected error in a code block and P se is the probability of a symbol 
error. Since the code can correct up to t errors, the block error probability is 
that of having more than t symbol errors in a block, which can be written as 


P 


ue 


t 


l-Z 

i=0 



(Pj( 1 - Pse) n i 


(2.65) 


If we didn’t have an RS code, any error in a code block would be uncorrectable, 
and the probability is given as 

P ue = 1 - (1 - PseT (2-66) 

One can plot a set of curves to illustrate the error-correcting performance 
of the code. A graph of Eq. (2.65) appears in Fig. 2.12 for the example in our 
discussion. Figure 2.12 is similar to Figs. 2.5 and 2.8 except that the .v-axis is 
plotted in opposite order and the y-axis has not been normalized by dividing 
by Eq. (2.66). Reading from the curve, we see for the case where t =5 and 

P se =10 3 : 

Pue = 3 x 1(T 7 (2.67) 

2.7.5 Effect of RS Coder-Decoder Failures 

We can use Eqs. (2.8) and (2.9) to evaluate the effect of coder-decoder failures. 
However, instead of computing per byte of transmission, we compute per block 
of transmission. Thus, by analogy with Eqs. (2.10) and (2.11), for our example 
we have 


P ue = e^ 2Xht x 3 x 10- 7 + (1 - e 2Xbt ) (2.68) 


where 


t = 8 x 255/3,6005 


(2.69) 
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Figure 2.12 Probability of an uncorrected error in a block of 255 1-byte symbols 
with 235 message symbols, 20 check symbols, and an error-correcting capacity of up 
to 10 errors versus the probability of symbol error [AHA Applications Note, used with 
permission]. 


We can compute when P ue is composed of equal values for code failures and 
chip failures by equating the first and second terms of Eq. (2.68). Substituting 
a typical value of B = 19,200, we find that this occurs when the chip failure 
rate is equal to about 5.04 x 10 5 failures per hour. Using our model, the chip 
failure rate = 0.004 y/g 10 6 , which is equivalent to g =1.6 X 10 12 —a very 
unlikely value. However, if we assume that P se =10 4 , then from Fig. 2.12 
we see that P ue = 3 x 10 1:5 and for B =19,200 that the effects are equal if the 
chip failure is equal to about 5.08 x 10 9 . Substitution into our chip failure 
rate model shows that this occurs when g ~ 2. Thus coder-decoder failures 
predominate for the second case. 

Another approach to investigating the impact of chip failures is to use manu¬ 
facturers’ data on RS coder-decoder failures. Some data exists [AHA Reliabil¬ 
ity Report, 1995] that is derived from accelerated tests. To collect enough fail¬ 
ure data for low-failure-rate components, an accelerated life test—the Arrhe¬ 
nius Relationship—is used to scale back the failure rates to normal operating 
temperatures (70-85°C). The resulting failure rates range from 50 to 700 x 
10 9 failures per hour, which certainly exceeds the just-calculated significant 
failure rate threshold of 5.08 X 10 9 , which was the value calculated for 19,200 
baud and a block error of 10 4 . (Note: using the gate model, we calculate X = 
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700 x 10 9 as equivalent to about 30,000 gates.) Clearly we conclude that the 
chip failures will predominate for some common ranges of the system param¬ 
eters. 


2.8 OTHER CODES 

There are many other types of error codes. We will briefly discuss the special 
features of the more popular codes and refer the reader to the references for 
additional details. 

1. Burst error codes. All the foregoing codes assume that errors occur 
infrequently and are independent, generally corrupting a single bit or a 
few bits in a word. In some cases, errors may occur in bursts. If a study 
of the failure modes of the device or medium we wish to protect by our 
coding indicates that errors occur in bursts to affect all the bits in a word, 
other coding techniques are required. The reader should consult the ref¬ 
erences for works that discuss Binary Block codes, w-out-of-n codes, 
Berger codes, Cyclic codes, and Reed-Solomon codes [Pradhan, 1986, 
1993]. 

BCH codes. This is a code that was independently discovered by Bose, 
Chaudhury, and Hocquenghem. (Reed-Solomon codes are a subclass of 
BCH codes.) These codes can be viewed as extensions of Hamming 
codes, which are easier to design and implement for a large number of 
correctable errors. 

Concatenated codes. This refers to the process of lumping more than one 
code word together to reduce the overhead—generally less for long code 
words (cf., Table 2.8). Disadvantages include higher error probability 
(since check bits cover several words), more complexity and depth, and 
a delay for associated decoding trees. 

Convolutional codes. Sometimes, codes are “nested”; for example, infor¬ 
mation can be coded by an inner code, and the resulting alphabet of legal 
code words can be treated as a “symbol” subjected to an outer code. An 
example might be the use of a Hamming SECSED code as the inner code 
word and a Reed-Solomon code as an outer code scheme. 

Check sum. The sum of all the numbers in a block of words is added, 
modulo 2, and the block and the sum are transmitted. The words in the 
received block are added again and the check sum is recomputed and 
checked with the transmitted sum. This is an error-detecting code. 
Duplication. One can always transmit the result twice and check the two 
copies. Although this may be inefficient, it is the only technique in some 
cases: for example, if we wish to check logical operations, AND, OR, 
and NAND. 

Fire code. An interleaved code for burst errors. The similar Reed- 
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Solomon code is now more popular since it is somewhat more efficient. 

Hamming codes. Other codes in the family use more error-correcting 
and -detecting bits, thereby achieving higher levels of fault tolerance. 

IC chip parity. Can be one bit per word, one bit per byte, or interlaced 
parity where b bits are watched over by i check bits. Thus each check 
bit “watches over” b/i bits. 

Interleaving. One approach to dealing with burst codes is to disassemble 
codes into a number of words, then reassemble them so that one bit is 
chosen from each word. For example, one could take 8 bytes and inter¬ 
leave (also called interlace) the bits so that a new byte is constructed 
from all the first bits of the original 8 bytes, another is constructed from 
all the second bits, and so forth. In this example, as long as the burst 
length is less than 8 bits and we have only one burst per 8 bytes, we are 
guaranteed that each new word can contain at most one error. 

Residue m codes. This is used for checking certain arithmetic operations, 
such as addition, multiplication, and shifting. One computes the code bits 
(residue, R ) that are concatenated (|, i.e., appended) to the message N to 
from N\R. The residue is the remainder left when N/m. After transmis¬ 
sion or computation, the new message bits N' are divided by m to form 
R'. Disagreement of R and R' indicates an error. 

Viterby decoding. A decoding algorithm for error correction of a 
Reed-Solomon or other convolutional code based on enumerating all the 
legal code words and choosing the one closest to the received words. For 
medium-sized search spaces, an organized search resembling a branch¬ 
ing tree was devised by Viterbi in 1967; it is often used to shorten the 
search. Forney recognized in 1968 that such trees are repetitive, so he 
devised an improvement that led to a diagram looking like a “lattice” 
used for supporting plants and trees. 
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PROBLEMS 

2.1. Find a recent edition of Jane’s all the World’s Aircraft in a technical or 
public library. Examine the data given in Table 2.1 for soft failures for 
the 6 aircraft types listed. From the book, determine the approximate 
number of electronic systems (aircraft avionics) for each of the aircraft 
that are computer-controlled (digital rather than analog). You may have 
to do some intelligent estimation to determine this number. One section 
in the book gives information on the avionics systems installed. Also, 
it may help to know that the U.S. companies (all mergers) that provide 
most of the avionics systems are Bendix/King/Allied, Sperry/Honeywell, 
and Collins/Rockwell. (Hint: You may have to visit the Web sites of 
the aircraft manufacturers or the avionics suppliers for more details. 

(a) Plot the number of soft fails per aircraft versus the number of avion¬ 
ics systems on board. Comment on the results. 

(b) It would be better to plot soft fails per aircraft versus the number of 
words of main memory for the avionics systems on board. Do you 
have any ideas on how you could obtain such data? 

2.2. Compute the minimum code distance for all the code words given in 
Table 2.2. 

(a) Compute for column (a) and comment. 

(b) Compute for column (b) and comment. 

(c) Compute for column (c) and comment. 
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2.3. Repeat the parity-bit coder and decoder designs given in Fig. 2.1 for an 
8-bit word with 7 message bits and 1 parity bit. Does this approach to 
design of a coder and decoder present any difficulties? 

2.4. Compare the design of problem 2.3 with that given in Fig. 2.2 on the 
basis of ease of design, complexity, practicality, delay time (assume all 
gates have a delay of D), and number of gates. 

2.5. Compare the results of problem 2.4 with the circuit of Fig. 2.4. 

2.6. Compute the binomial probabilities B(r: 8, q) for r = 1 to 8. 

(a) Does the sum of these probabilities check with Eq. (2.5)? 

(b) Show for what values of q the term B( \ : 8, q) dominates all the error- 
occurrence probabilities. 

2.7. Find a copy of the latest military failure-rate manual (MIL-FIDBK-217) 
and plot the data on Fig. B7 of Appendix B. Does it agree? Can you 
find any other IC failure-rate information? (Flint: The telecommunication 
industry and the various national telephone companies maintain large 
failure-rate databases. Also, the Annual Reliability and Maintainability 
Symposium from the IEEE regularly publishes papers with failure-rate 
data.) Does this data agree with the other results? What advances have 
been made in the last decade or so? 

2.8. Assume that a 10% reduction in the probability of undetected error from 
coder and decoder failures is acceptable. 

(a) Compute the value of B at which a 10% reduction occurs for fixed 
values of q. 

(b) Plot the results of part (a) and interpret. 

2.9. Check the results given in Table 2.5. Flow is the distance d related to 
the number of check bits? Explain. 

2.10. Check the values given in Tables 2.7 and 2.8. 

2.11. The Flamming SECSED code with 4 message bits and 3 check bits is 
used in the text as an example (Section 2.4.3). It was stated that we could 
use a brute force technique of checking all the legal or illegal code words 
for error detection, as was done for the parity-bit code in Fig. 2.1. 

(a) List all the legal and illegal code words for this example and show 
that the code distance is 3. 

(b) Design an error-detector circuit using minimized two-level logic (cf. 
Fig. 2.1). 

2.12. Design a check bit generating circuit for problem 2.11 using Eqs. 
(2.22a-c) and EXOR gates. 

2.13. One technique for error correction is to pick the nearest code word as 
the correct word once an error has been detected. 
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(a) Devise a software algorithm that can be used to program a micro¬ 
processor to perform such error correction. 

(b) Devise a hardware design that performs the error correction by 
choosing the closest word. 

(c) Compare complexity and speed of designs (a) and (b). 

2.14. An error-correcting circuit for a Hamming (7, 4) SECSED is given in 
Fig. 2.7. How would you generate the check bits that are defined in Eqs. 
(2.22a-c)? Is there a better way than that suggested in problem 2.12? 

2.15. Compare the designs of problems 2.11, 2.12, and 2.13 with Hamming’s 
technique in problem 2.14. 

2.16. Give a complete design for the code generator and checker for a Ham¬ 
ming (12, 8 ) SECSED code following the approach of Fig. 2.7. 

2.17. Repeat problem 2.16 for a SECDED code. 

2.18. Repeat problem 2.8 for the design referred to in Table 2.14. 

2.19. Retransmission as described in Section 2.5 tends to decrease the effec¬ 
tive baud rate ( B ) of the transmission. Compare the unreliability and the 
effective baud rate for the following designs: 

(a) Transmit each word twice and retransmit when they disagree. 

(b) Transmit each word three times and use the majority of the three 
values to determine the output. 

(c) Use a parity-bit code and only retransmit when the code detects an 
error. 

(d) Use a Hamming SECSED code and only retransmit when the code 
detects an error. 

2.20. Add the probabilities of generator and checker failure for the reliability 
examples given in Section 2.5.3. 

2.21. Assume we are dealing with a burst code design for error detection 
with a word length of 12 bits and a maximum burst length of 4, as 
noted in Eqs. (2.46)—(2.50). Assume the code vector V(x\,X 2 , ■ ■ . ,xp) = 
nCiC2C3C 4 10100011). 

(a) Compute C 1 C 2 C 3 C 4 . 

(b) Assume no errors and show how the syndrome works. 

(c) Assume one error in bit C 2 and show how the syndrome works. 

(d) Assume one error in bit xg ; then show how the syndrome works. 

(e) Assume two errors in bits xg and xg ; then show how the syndrome 
works. 

(f) Assume three errors in bits xg, xg, and x 10 ; then show how the syn¬ 
drome works. 
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(g) Assume four errors in bits xj, Xg, Xg, and X\q: then show how the 
syndrome works. 

(h) Assume five errors in bits xj, xg, xg, Xio, and x\i, then show how 
the syndrome fails. 

(i) Repeat the preceding computations using a different set of four equa¬ 
tions to calculate the check bits. 

2.22. Draw a circuit for generating the check bits, the syndrome vector, and 
the error-detection output for the burst error-detecting code example of 
Section 2.6.2. 

(a) Use parallel computation and use EXOR gates. 

(b) Use serial computation and a linear feedback shift register. 

2.23. Compute the probability of undetected error for the code of problem 
2.22 and compare with the probability of undetected error for the case 
of no error detection. Assume perfect hardware. 

2.24. Repeat problem 2.23 assuming that the hardware is imperfect. 

(a) Assume a model as in Section 2.3.4 and 2.4.5. 

(b) Plot the results as in Figs. 2.5 and 2.8. 

2.25. Repeat problem 2.22 for the burst error-detecting code in Section 2.6.3. 

2.26. Repeat problem 2.23 for the burst error-detecting code in Section 2.6.3. 

2.27. Repeat problem 2.24 for the burst error-detecting code in Section 2.6.3. 

2.28. Analyze the design of Fig. 2.4 and show that it is equivalent to Fig. 2.2. 

Also, explain how it can be used as a generator and checker. 

2.29. Explain in detail the operation of the error-correcting circuit given in 
Fig. 2.7. 

2.30. Design a check bit generator circuit for the SECDED code example in 
Section 2.4.4. 

2.31. Design an error-correcting circuit for the SECDED code example in Sec¬ 
tion 2.4.4. 

2.32. Explain how a distance 3 code can be implemented as a double error¬ 
detecting code (DED). Give the circuit for the generator and checker. 

2.33. Explain how a distance 4 code can be implemented as a triple error¬ 
detecting code (TED). Give the circuit for the generator and checker. 

2.34. Construct a table showing the relationship between the burst length t, 
the auxiliary check bits u, the total number of check bits, the number 
of message bits, and the length of the code word. Use a tabular format 
similar to Table 2.7. 
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2.35. Show for the u = 5 and t = 3 code example given in Section 2.6.3 that 
after x shifts, the leftmost bits of R2 and R3 in Fig. 2.10 agree. 

2.36. Show a complete circuit for error correction that includes Fig. 2.10 in 
addition to a counter, a decoder, a bit-complementing circuit, and a cor¬ 
rected word storage register, as well as control logic. 

2.37. Show a complete circuit for error correction that includes Fig. 2.10 in 
addition to a counter, an EXOR-complementing circuit, and a corrected 
word storage register, as well as control logic. 

2.38. Show a complete circuit for error correction that includes Fig. 2.10 in 
addition to a counter, an EXOR-complementing circuit, and a circulating 
register for R1 to contain the corrected word, as well as control logic. 

2.39. Explain how the circuit of Fig. 2.11 acts as a coder. Input the message 
bits; then show what is generated and which bits correspond to the auxil¬ 
iary syndrome and which ones correspond to the real syndrome. 

2.40. What additional circuitry is needed (if any) to supplement Fig. 2.11 to 
produce a coder. Explain. 

2.41. Using Fig. 2.12 for the Reed-Solomon code, plot a graph similar to Fig. 
2 . 8 . 
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REDUNDANCY, SPARES, AND 
REPAIRS 


3.1 INTRODUCTION 

This chapter deals with a variety of techniques for improving system reliability 
and availability. Underlying all these techniques is the basic concept of redun¬ 
dancy, providing alternate paths to allow the system to continue operation even 
when some components fail. Alternate paths can be provided by parallel com¬ 
ponents (or systems). The parallel elements can all be continuously operated, 
in which case all elements are powered up and the term parallel redundancy 
or hot standby is often used. It is also possible to provide one element that is 
powered up (on-line) along with additional elements that are powered down 
(standby), which are powered up and switched into use, either automatically 
or manually, when the on-line element fails. This technique is called standby 
redundancy or cold redundancy. These techniques have all been known for 
many years; however, with the advent of modern computer-controlled digital 
systems, a rich variety of ways to implement these approaches is available. 
Sometimes, system engineers use the general term redundancy management 
to refer to this body of techniques. In a way, the ultimate cold redundancy 
technique is the use of spares or repairs to renew the system. At this level 
of thinking, a spare and a repair are the same thing—except the repair takes 
longer to be effected. In either case for a system with a single element, we 
must be able to tolerate some system downtime to effect the replacement or 
repair. The situation is somewhat different if we have a system with two hot 
or cold standby elements combined with spares or repairs. In such a case, once 
one of the redundant elements fails and we detect the failure, we can replace 
or repair the failed element while the system continues to operate; as long as the 
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replacement or repair takes place before the operating element fails, the system 
never goes down. The only way the system goes down is for the remaining 
element(s) to fail before the replacement or repair is completed. 

This chapter deals with conventional techniques of improving system or 
component reliability, such as the following: 


1. Improving the manufacturing or design process to significantly lower 
the system or component failure rate. Sometimes innovative engineer¬ 
ing does not increase cost, but in general, improved reliability requires 
higher cost or increases in weight or volume. In most cases, however, the 
gains in reliability and decreases in life-cycle costs justify the expendi¬ 
tures. 

2. Parallel redundancy, where one or more extra components are operating 
and waiting to take over in case of a failure of the primary system. In 
the case of two computers and, say, two disk memories, synchronization 
of the primary and the extra systems may be a bit complex. 

3. A standby system is like parallel redundancy; however, power is off in 
the extra system so that it cannot fail while in standby. Sometimes the 
sensing of primary system failure and switching over to the standby sys¬ 
tem is complex. 

4. Often the use of replacement components or repairs in conjunction with 
parallel or standby systems increases reliability by another substantial 
factor. Essentially, once the primary system fails, it is a race to fix or 
replace it before the extra system(s) fails. Since the repair rate is gener¬ 
ally much higher than the failure rate, the repair almost always wins the 
race, and reliability is greatly increased. 


Because fault-tolerant systems generally have very low failure rates, it is 
hard and expensive to obtain failure data from tests. Thus second-order factors, 
such as common mode and dependent failures, may become more important 
than they usually are. 

The reader will need to use the concepts of probability in Appendix A, 
Sections A1-A6.3 and those of reliability in Appendix B3 for this chapter. 
Markov modeling will appear later in the chapter; thus the principles of the 
Markov model given in Appendices A8 and B6 will be used. The reader who 
is unfamiliar with this material or needs review should consult these sections. 

If we are dealing with large complex systems, as is often the case, it is 
expedient to divide the overall problem into a number of smaller subproblems 
(the “divide and conquer” strategy). An approximate and very useful approach 
to such a strategy is the method of apportionment discussed in the next section. 
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r \ r 2 r k 

Figure 3.1 A system model composed of k major subsystems, all of which are nec¬ 
essary for system success. 


3.2 APPORTIONMENT 

One might conceive system design as an optimization problem in which one 
has a budget of resources (dollars, pounds, cubic feet, watts, etc.), and the goal 
is to achieve the highest reliability within the constraints of the available bud¬ 
get. Such an approach is discussed in Chapter 7; however, we need to use some 
of the simple approaches to optimization as a structure for comparison of the 
various methods discussed in this chapter. Also, in a truly large system, there 
are too many possible combinations of approach; a top-down design philoso¬ 
phy is therefore useful to decompose the problem into simpler subproblems. 
The technique of apportionment serves well as a “divide and conquer” strategy 
to break down a large problem. 

Apportionment techniques generally assume that the highest level—the over¬ 
all system—can be divided into 5-10 major subsystems, all of which must work 
for the system to work. Thus we have a series structure as shown in Fig. 3.1. 

We denote xi as the event success of element (subsystem) 1, x\ is the event 
failure of element 1, P(x i) = 1 - P(x j) is the probability of success (the reli¬ 
ability, r\). The system reliability is given by 

R s = P(x i n x 2 •• • n Xk) (3.1a) 

and if we use the more common engineering notation, this equation becomes 

R s = P(xix 2 -■-x k ) (3.1b) 

If we assume that all the elements are independent, Eq. (3.1a) becomes 

k 

R s = FI n (3.2) 

i = 1 

To illustrate the approach, let us assume that the goal is to achieve a system 
reliability equal to or greater than the system goal, R 0 , within the cost budget, 
c o- We let the single constraint be cost, and the total cost, c, is given by the 
sum of the individual component costs, c,-. 

k 

C=X 

i = l 


Ci 


(3.3) 
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We assume that the system reliability given by Eq. (3.2) is below the sys¬ 
tem specification or goal, and that the designer must improve the reliability 
of the system. We further assume that the maximum allowable system cost, 
Co, is generally sufficiently greater than c so that the system reliability can be 
improved to meet its reliability goal, R s > /?q; otherwise, the goal cannot be 
reached, and the best solution is the one with the highest reliability within the 
allowable cost constraint. 

Assume that we have a method for obtaining optimal solutions and, in 
the case where more than one solution exceeds the reliability goal within the 
cost constraint, that it is useful to display a number of “good” solutions. The 
designer may choose to just meet the reliability goal with one of the subop- 
timal solutions and save some money. Alternatively, there may be secondary 
factors that favor a good suboptimal solution. Lastly, a single optimum value 
does not give much insight into how the solution changes if some of the cost 
or reliability values assumed as parameters are somewhat in error. A family of 
solutions and some sensitivity studies may reveal a good suboptimal solution 
that is less sensitive to parameter changes than the true optimum. 

A simple approach to solving this problem is to assume an equal apportion¬ 
ment of all the elements r, = r\ to achieve Rq will be a good starting place. 
Thus Eq. (3.2) becomes 


k 

tfo=n r,- = (n)* (3.4) 

i = 1 

and solving for r\ yields 

n = (Ro) 1/k (3.5) 

Thus we have a simple approximate solution for the problem of how to 
apportion the subsystem reliability goals based on the overall system goal. 
More details of such optimization techniques appear in Chapter 7. 


3.3 SYSTEM VERSUS COMPONENT REDUNDANCY 

There are many ways to implement redundancy. In Shooman [1990, Sec¬ 
tion 6.6.1], three different designs for a redundant auto-braking system are 
compared: a split system, which presently is used on American autos either 
front/rear or LR-RF/RR-LF diagonals; two complete systems; or redundant 
components (e.g., parallel lines). Other applications suggest different possibili¬ 
ties. Two redundancy techniques that are easily classified and studied are com¬ 
ponent and system redundancy. In fact, one can prove that component redun¬ 
dancy is superior to system redundancy in a wide variety of situations. 

Consider the three systems shown in Fig. 3.2. The reliability expression for 
system (a) is 
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Figure 3.2 Comparison of three different systems: (a) single system, (b) unit redun¬ 
dancy, and (c) component redundancy. 


R a (p) = P(x 1 )P(x 2 )=p 2 (3.6) 

where both X \ and x 2 are independent and identical and P(x \) =P(x 2 ) = p. The 
reliability expression for system (b) is given simply by 

Rb(p) = P(x ix 2 + X3X4) (3.7a) 

For independent identical units (IIU) with reliability of p, 

R b (p) = 2 Ra -R 2 a = p 2 (2 - P 2 ) (3.7b) 


In the case of system (c), one can combine each component pair in parallel 
to obtain 

R h (p) = P(x 1 + x 3 )P(x 2 + x 4 ) 

Assuming IIU, we obtain 

R c (p)=p 2 (2~p) 2 

To compare Eqs. (3.8b) and (3.7b), we use the ratio 

Rc(p) _ P 2 (2 ~ P) 2 _ (2 - p ) 2 
R b (p ) P 2 ( 2 - P 2 ) (2 - p 2 ) 

Algebraic manipulation yields 

R c (p) = (2- p) 2 ^ 4-4 p + p 2 ^ (2- p 2 ) + 2(1 - p) 2 = - N . ^ 

Rhip) (2 - p 2 ) 2 -p 2 2- p 2 2-p 2 


(3.8a) 


(3.8b) 


(3.9) 


211 - d) 2 


(3.10) 
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Because 0 <p < 1, the term 2 - p 2 > 0, and R c (p)/R b (p ) > 1; thus compo¬ 
nent redundancy is superior to system redundancy for this structure. (Of course, 
they are equal at the extremes when p = 0 or p = 1 .) 

We can extend these chain structures into an //-element series structure, two 
parallel //-element system-redundant structures, and a series of n structures of 
two parallel elements. In this case, Eq. (3.9) becomes 


R c {p) _ (2 - pf 
R b (p) (2 - p") 


(3.11) 


Roberts [1964, p. 260] proves by induction that this ratio is always greater 
than 1 and that component redundancy is superior regardless of the number of 
elements n. 

The superiority of component redundancy over system redundancy also 
holds true for nonidentical elements; an algebraic proof is given in Shooman 
[1990, p. 282], 

A simpler proof of the foregoing principle can be formulated by consider¬ 
ing the system tie-sets. Clearly, in Fig. 3.2(b), the tie-sets are xixq and X 3 X 4 , 
whereas in Fig. 3.2(c), the tie-sets are X\X 2 , X3X4, x 1 X 4 , and X3X2. Since the sys¬ 
tem reliability is the probability of the union of the tie-sets, and since system (c) 
has the same two tie-sets as system (b) as well as two additional ones, the com¬ 
ponent redundancy configuration has a larger reliability than the unit redun¬ 
dancy configuration. It is easy to see that this tie-set proof can be extended to 
the general case. 

The specific result can be broadened to include a large number of structures. 
As an example, consider the system of Fig. 3.3(a) that can be viewed as a 
simple series structure if the parallel combination of x\ and X 2 is replaced by 
an equivalent branch that we will call x$. Then X 5 , X 3 , and X 4 form a simple 
chain structure, and component redundancy, as shown in Fig. 3.3(b), is clearly 
superior. Many complex configurations can be examined in a similar manner. 
Unit and component redundancy are compared graphically in Fig. 3.4. 

Another interesting case in which one can compare component and unit 
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Figure 3.4 Redundancy comparison: (a) component redundancy and (b) unit redun¬ 
dancy. [Adapted from Figs. 7.10 and 7.11, Reliability Engineering, ARINC Research 
Corporation, used with permission, Prentice-Hall, Englewood Cliffs, NJ, 1964.] 
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(a) 



Component probability (R) 

(b) 


Figure 3.5 Comparison of component and unit redundancy for r-out-of-n systems: 
(a) a 2-out-of-4 system and (b) a 3-out-of-4 system. 


redundancy is in an r-out-of-n system (the system succeeds if r-out-of -77 com¬ 
ponents succeed). Immediately, one can see that for r = n, the structure is a 
series system, and the previous result applies. If r = 1, the structure reduces 
to n parallel elements, and component and unit redundancy are identical. The 
interesting cases are then 2 < r < n. The results for 2-out-of-4 and 3-out-of- 
4 systems are plotted in Fig. 3.5. Again, component redundancy is superior. 
The superiority of component over unit redundancy in an r-out-of -77 system is 
easily proven by considering the system tie-sets. 

All the above analysis applies to two-state systems. Different results are 
obtained for multistate models; see Shooman [1990, p. 286]. 
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(a) System redundancy 
(one coupler) 


(b) Component redundancy 
(three couplers) 


Figure 3.6 Comparison of system and component redundancy, including coupling. 


In a practical case, implementing redundancy is a bit more complex than 
indicated in the reliability graphs used in the preceding analyses. A simple 
example illustrates the issues involved. We all know that public address sys¬ 
tems consisting of microphones, connectors and cables, amplifiers, and speak¬ 
ers are notoriously unreliable. Using our principle that component redundancy 
is better, we should have two microphones that are connected to a switching 
box, and we should have two connecting cables from the switching box to dual 
inputs to amplifier 1 or 2 that can be selected from a front panel switch, and we 
select one of two speakers, each with dual wires from each of the amplifiers. 
We now have added the reliability of the switches in series with the parallel 
components, which lowers the reliability a bit; however, the net result should 
be a gain. Suppose we carry component redundancy to the extreme by trying 
to parallel the resistors, capacitors, and transistors in the amplifier. In most 
cases, it is far from simple to merely parallel the components. Thus how low 
a level of redundancy is feasible is a decision that must be left to the system 
designer. 

We can study the required circuitry needed to allow redundancy; we will 
call such circuitry or components couplers. Assume, for example, that we have 
a system composed of three components and wish to include the effects of 
coupling in studying system versus component reliability by using the model 
shown in Fig. 3.6. (Note that the prime notation is used to represent a “com¬ 
panion” element, not a logical complement.) For the model in Fig. 3.6(a), the 
reliability expression becomes 

R a = P(x 1 X 2 X 3 + x\x' 2 x' 2 )P{x c ) (3.12) 

and if we have IIU and P(x c ) = Kp(x c ) = Kp, 

R a =(2p 3 -p 6 )Kp (3.13) 

Similarly, for Fig. 3.6(b) we have 

R b = P(xi +x\)P(x 2 +x' 2 )P(x 3 + x' 3 )P(x c 1 )P(x c 2 )P(x c3 ) (3.14) 
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and if we have IIU and P(x< \) = P(x c2 ) = P(x c 3 ) = Kp, 

R b = (2 p - p 2 fk 3 p 3 (3.15) 

We now wish to explore for what value of K Eqs. (3.13) and (3.15) are 
equal: 


(Ip 3 - p 6 )Kp = (2 p - p 2 ) 3 K 3 p 3 (3.16a) 


Solving for K yields 


2 _ ( 2 p 3 -p 6 ) 

( 2 p - p 2 ) 3 p 2 


(3.16b) 


If p = 0.9, substitution in Eq. (3.16) yields ^=1.085778501, and the cou¬ 
pling reliability Kp becomes 0.9772006509. The easiest way to interpret this 
result is to say that if the component failure probability 1 — p is 0 . 1 , then 
component and system reliability are equal if the coupler failure probability is 
0.0228. In other words, if the coupler failure probability is less than 22.8% of 
the component failure probability, component redundancy is superior. Clearly, 
coupler reliability will probably be significant in practical situations. 

Most reliability models deal with two element states—good and bad; how¬ 
ever, in some cases, there are more distinct states. The classical case is a diode, 
which has three states: good, failed-open, and failed-shorted. There are also 
analogous elements, such as leaking and blocked hydraulic lines. (One could 
contemplate even more than three states; for example, in the case of a diode, 
the two “hard”-failure states could be augmented by an “intermittent” short- 
failure state.) For a treatment of redundancy for such three-state elements, see 
Shooman [1990, p. 286]. 


3.4 APPROXIMATE RELIABILITY FUNCTIONS 

Most system reliability expressions simplify to sums and differences of var¬ 
ious exponential functions once the expressions for the hazard functions are 
substituted. Such functions may be hard to interpret; often a simple computer 
program and a graph are needed for interpretation. Notwithstanding the case of 
computer computations, it is still often advantageous to have techniques that 
yield approximate analytical expressions. 

3.4.1 Exponential Expansions 

A general and very useful approximation technique commonly used in many 
branches of engineering is the truncated series expansion. In reliability work, 
terms of the form e z occur time and again; the expressions can be simplified by 
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series expansion of the exponential function. The Maclaurin series expansion 
of e z about Z = 0 can be written as follows: 


y Z 2 Z 3 (~Z)" 

e z =l-Z+ — - — + + . 

2! 3! n\ 


(3.17) 


We can also write the series in n terms and a remainder term [Thomas, 1965, 
p. 791], which accounts for all the terms after (-Z)”/n! 


(~Z) n 


e = 1 - Z + —-— h- v 

2 ! 3 ! n ! 


+ Rn(Z) 


(3.18) 


where 


R„(Z) = (- 1)' , + 1 [ (Z ^ e~*d$ (3.19) 

Jo n\ 

We can therefore approximate e v by n terms of the series and use R„(Z) 
to approximate the remainder. In general, we use only two or three terms of 
the series, since in the high-reliability region e 7 ~ 1, Z is small, and the high- 
order terms Z" in the series expansion becomes insignificant. For example, the 
reliability of two parallel elements is given by 


(2W z ) + (-W 2z ) = (2-2Z + 


2Z 2 2Z 3 


2! 3! 


+ • • • + 


2(-Z) n 


n! 


. (2Z) 2 (2Z) 3 

+ -1 + 2Z - -—— + ’ 


(2 Z) n 


2 ! 


3! 


= 1 - Z 2 + Z 3 - — Z 4 + — z 5 -+ 

12 


(3.20) 


Two- and three-term approximations to Eqs. (3.17) and (3.20) are compared 
with the complete expressions in Fig. 3.7(a) and (b). Note that the two-term 
approximation is a “pessimistic” one, whereas the three-term expression is 
slightly “optimistic”; inclusion of additional terms will give a sequence of alter¬ 
nate upper and lower bounds. In Shooman [1990, p. 217], it is shown that the 
magnitude of the nth term is an upper bound on the error term, R n (Z), in an 
//-term approximation. 

If the system being modeled involves repair, generally a Markov model is 
used, and oftentimes Laplace transforms are used to solve the Markov equa¬ 
tions. In Section B8.3, a simplified technique for finding the series expansion 
of a reliability function—cf. Eq. (3.20)—directly from a Laplace transform is 
discussed. 
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Figure 3.7 Comparison of exact and approximate reliability functions: (a) single unit 
and (b) two parallel units. 

3.4.2 System Hazard Function 

Sometimes it is useful to compute and study the system hazard function (fail¬ 
ure rate). For example, suppose that a system consists of two series elements, 
X 2 Y 3 , in parallel with a third, x\. Thus, the system has two “success paths”: it 
succeeds if x\ works or if X 2 and A 3 both work. If all elements have identical 
constant hazards, X, the reliability function is given by 

R(t) = P(x 1 + * 2 * 3 ) = e~ Xt + e- 2U - e- iU (3.21) 
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From Appendix B, we see that z(t) is given by the density function divided 
by the reliability function, which can be written as the negative of the time 
derivative of the reliability function divided by the reliability function. 


z(0 


m 

R(t) 


R(t ) A(l+2e x '-3rT 2X 0 

j R(t) ~ 1 + e- Xt - e- 2Xt 


(3.22) 


Expanding z(t) in a Taylor series, 

z(t) = 1 + \t - 3A 2 r/2 + • • • (3.23) 

We can use such approximations to compare the equivalent hazard of various 
systems. 


3.4.3 Mean Time to Failure 

In the last section, it was shown that reliabiilty calculations become very com¬ 
plicated in a large system when there are many components and a diverse reli¬ 
ability structure. Not only was the reliability expression difficult to write down 
in such a case, but computation was lengthy, and interpretation of the individual 
component contributions was not easy. One method of simplifying the situa¬ 
tion is to ask for less detailed information about the system. A useful figure 
of merit for a system is the mean time to failure (MTTF). 

As was derived in Eq. (B51) of Appendix B, the MTTF is the expected value 
of the time to failure. The standard formula for the expected value involves 
the integral of tf(t)\ however, this can be expressed in terms of the reliability 
function. 


MTTF = R(t) dt (3.24) 

J o 

We can use this expression to compute the MTTF for various configura¬ 
tions. For a series reliability configuration of n elements in which each of the 
elements has a failure rate Zi(t) and Z(t) = J z(t) dt, one can write the reliability 
expression as 


R(t ) = exp 


Z Zi(t) 


i= 1 


(3.25a) 


and the MTTF is given by 


MTTF = 



Z z,(t) I J dt 


(3.25b) 
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If the series system has components with more than one type of hazard 
model, the integral in Eq. (3.25b) is difficult to evaluate in closed form but can 
always be done using a series approximation for the exponential integrand; see 
Shooman [1990, p. 20], 

Different equations hold for a parallel system. For two parallel elements, 
the reliability expression is written as R(t) = e Z ' U) + e Z2<t) - e t- z i(0+Z2(0]_ jf 
both system components have a constant-hazard rate, and we apply Eq. (3.24) 
to each term in the reliability expression, 

MTTF = j- + j- + —L_ (3.26) 

A i At Ai + At 

In the general case of n parallel elements with constant-hazard rate, the 
expression becomes 


MTTF = 



1 1 

Al + At A i + A3 


+ • • • + 


1 

A; + A j 


+ 


1 


1 


1 


+ 


Al + At + A3 A] + At + A, 


A; + A; + A k 


• + (-!)' 


n + 1 


X A,- 

i = 1 


(3.27) 


If the n units are identical—that is, Aj =\2 = • • ■ = A„ =A—then Eq. (3.27) 
becomes 


MTTF = 


- ■•• + (-!) 


,n+ 1 


1 " i 

A ! = 1 l 


(3.28a) 

The preceding series is called the harmonic series; the summation form is 
given in Jolley [1961, p. 26, Eq. (200)] or Courant [1951, pp. 380]. This series 
occurs in number theory, and a series expansion is attributed to the famous 
mathematician Euler; the constant in the expansion (0.577) is called Euler’s 
constant [Jolley, 1961, p. 14, Eq. (70)]. 


\_ " i__i_ 

A n i - A 


1 1 


0.577 + In n + 


2 n \2n{n+\) 


(3.28b) 
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Figure 3.8 Parallel reliability configuration of n elements and a coupling device x c . 

3.5 PARALLEL REDUNDANCY 
3.5.1 Independent Failures 

One classical approach to improving reliability is to provide a number of ele¬ 
ments in the system, any one of which can perform the necessary function. If 
a system of n elements can function properly when only one of the elements is 
good, a parallel configuration is indicated. (A parallel configuration of n items 
is shown in Fig. 3.8.) The reliability expression for a parallel system may be 
expressed in terms of the probability of success of each component or, more 
conveniently, in terms of the probability of failure (coupling devices ignored). 

R(t) = P(x i +X 2 + -t- x n ) = 1 - P(x 1 X 2 • ■ ■ x n ) (3.29) 

In the case of constant-hazard components, Pj =P(x ,) = 1 - e x,/ , and Eq. 
(3.29) becomes 


n 


Rif) = 1 


n (1 - e x “) 


J -1 

In the case of linearly increasing hazard, the expression becomes 

n 


R(r ) = l 


n a - e- K ^h 


(3.30) 


(3.31) 


. i - 1 

We recall that in the example of Fig. 3.6(a), we introduced the notion that 
a coupling device is needed. Thus, in the general case, the system reliability 
function is 


R(t) 




P(x c ) 


(3.32) 


If we have IIU with constant-failure rates, then Eq. (3.32) becomes 
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R(t) = [1 - (1 - e- Xt )"]e- Kt (3.33a) 

where X is the element failure rate and \ c is the coupler failure rate. Assuming 
X c t <\t « 1, we can simplify Eq. (3.33) by approximating e K,: ‘ and e ki by 
the hrst two terms in the expansion—cf. Eq. (3.17)—yielding (1 - e Kl ) ~ Xt, 
e Xct ~ 1 - X r t. Substituting these approximations into Eq. (3.33a), 

fl(f) = [l-(Xf)''](l-X c O (3.33b) 

Neglecting the last term in Eq. (3.33b), we have 

R(t ) = 1 - X c t - (Xt) n (3.34) 

Clearly, the coupling term in Eq. (3.34) must be small or it becomes the 
dominant portion of the probability of failure. We can obtain an “upper limit” 
for \ c if we equate the second and third terms in Eq. (3.34) (the probabilities 
of coupler failure and parallel system failure) yielding 

^<(Xt) n -' (3.35) 

A 

For the case of n =3 and a comparison at Xt= 0.1, we see that X c /X < 0.01. 
Thus the failure rate of the coupling device must be less than 1/100 that of the 
element. In this example, if X c = 0.01X, then the coupling system probability of 
failure is equal to the parallel system probability of failure. This is a limiting 
factor in the application of parallel reliability and is, unfortunately, sometimes 
neglected in design and analysis. In many practical cases, the reliability of 
the several elements in parallel is so close to unity that the reliability of the 
coupling element dominates. 

If we examine Eq. (3.34) and assume that X e = 0, we see that the number 
of parallel elements n affects the curvature of R(t) versus t. In general, the 
more parallelism in a reliability block diagram, the less the initial slope of 
the reliability curve. The converse is true with more series elements. As an 
example, compare the reliability functions for the three reliability graphs in 
Fig. 3.9 that are plotted in Fig. 3.10. 



Figure 3.9 Three reliability structures: (a) single element, (b) two series elements, 
and (c) two parallel elements. 


Team-Fly 
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(a) 



Normalized time t = ^kt 


(b) 

Figure 3.10 Comparison of reliability functions: (a) constant-hazard elements and 
(b) linearly increasing hazard elements. 


3.5.2 Dependent and Common Mode Effects 

There are two additional effects that must be discussed in analyzing a parallel 
system: that of common mode (common cause) failures and that of depen¬ 
dent failures. A common mode failure is one that affects all the elements in a 
redundant system. The term was popularized when the first reliability and risk 
analyses of nuclear reactors were performed in the 1970s [McCormick, 1981, 
Chapter 12]. To protect against core melt, reactors have two emergency core¬ 
cooling systems. One important failure scenario—that of an earthquake—is 
likely to rupture the piping on both cooling systems. 

Another example of common mode activity occurred early in the space pro¬ 
gram. During the reentry of a Gemini spacecraft, one of the two guidance com¬ 
puters failed, and a few minutes later the second computer failed. Fortunately, 




100 REDUNDANCY, SPARES, AND REPAIRS 


the astronauts had an additional backup procedure. Based on rehearsed pro¬ 
cedures and precomputations, the Ground Control advised the astronauts to 
maneuver the spacecraft, to align the horizon with one of a set of horizontal 
scribe marks on the windows, and to rotate the spacecraft so that the Sun was 
aligned with one set of vertical scribe marks. The Ground Control then gave 
the astronauts a countdown to retro-rocket ignition and a second countdown 
to rocket cutoff. The spacecraft splashed into the ocean—closer to the recov¬ 
ery ship than in any previous computer-controlled reentry. Subsequent analysis 
showed that the temperature inside the two computers was much higher than 
expected and that the diodes in the separate power supply of each computer 
had burned out. From this example, we learn several lessons: 

1. The designers provided two computers for redundancy. 

2. Correctly, two separate power supplies were provided, one for each com¬ 
puter, to avoid a common power-supply failure mode. 

3. An unexpectedly high ambient temperature caused identical failues in the 
diodes, resulting in a common mode failure. 

4. Fortunately, there was a third redundant mode that depended on a com¬ 
pletely different mechanism, the scribe marks, and visual alignment. 
When parallel elements are purposely chosen to involve devices with 
different failure mechanisms to avoid common mode failures, the term 
diversity is used. 

In terms of analysis, common mode failures behave much like failures of 
a coupling mechanism that was studied previously. In fact, we can use Eq. 
(3.33) to analyze the effect if we use \ c to represent the sum of coupling and 
common mode failure rates. (A fortuitous choice of subscript!) 

Another effect to consider in parallel systems is the effect of dependent 
failures. Suppose we wish to use two parallel satellite channels for reliable 
communication, and the probability of each channel failure is 0.01. For a single 
channel, the reliability would be 0.99; for two parallel channels, c\ and c 2 , we 
would have 


R = P(c i + c 2 ) = 1 - P(cic 2 ) (3.36) 

Expanding the last term in Eq. (3.36) yields 

R = 1 - P(c,c 2 ) = 1 - P(ci)P(jC 2 |ci) (3.37) 

If the failures of both channels, ci and c 2 , are independent, Eq. (3.37) yields 
R = 1 - 0.01 x 0.01 = 0.9999. However, suppose that one-quarter of satel¬ 
lite transmission failures are due to atmospheric interference that would affect 
both channels. In this case, P(c 2 |ci) is 0.25, and Eq. (3.37) yields R =1 - 
0.01 x 0.25 = 0.9975. Thus for a single channel, the probability of failure is 
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0 .01; with two independent parallel channels, it is 0.0001, but for dependent 
channels, it is 0.0025. This means that dependency has reduced the expected 
100-fold reduction in failure probabilities to a reduction by only a factor of 4. 
In general, a modeling of dependent failures requires some knowledge of the 
failure mechanisms that result in dependent modes. 

The above analysis has explored many factors that must be considered 
in analyzing parallel systems: coupling failures, common mode failures, and 
dependent failures. Clearly, only simple models were used in each case. More 
complex models may be formulated by using Markov process models—to be 
discussed in Section 3.7, where we analyze standby redundancy. 


3.6 AN / -OUT-OF-n STRUCTURE 

Another simple structure that serves as a useful model for many reliability 
problems is an r-out-of-n structure. Such a model represents a system of n 
components in which r of the n items must be good for the system to succeed. 
(Of course, r is less than n.) An example of an r-out-of-n structure is a fiber¬ 
optic cable, which has a capacity of n circuits. If the application requires r 
channels of the transmission, this is an r-out-of-n system (r: n). If the capacity 
of the cable n exceeds r by a significant amount, this represents a form of 
parallel redundancy. We are of course assuming that if a circuit fails it can be 
switched to one of the n-r “extra circuits.” 

We may formulate a structural model for an r-out-of-n system, but it is 
simpler to use the binomial distribution if applicable. The binomial distribution 
can be used only when the n components are independent and identical. If the 
components differ or are dependent, the structural-model approach must be 
used. Success of exactly r-out-of-n identical and independent items is given 
by 


B(r:n) = ( n )p r (l-pr r (3.38) 

where r: n stands for r out of n, and the success of at least r-out-of-n items is 
given by 

n 

Ps = X B(k : n) (3.39) 

k = r 

For constant-hazard components, Eq. (3.38) becomes 

R(t) = z j e- kXt a - e - x, y'- k 


(3.40) 
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Similarly, for linearly increasing or Weibull components, the reliability func¬ 
tions are 


R(t) e- kKtl '\\ - e - K,1/1 ) n - k (3.41a) 

and 

R(t) = X ( " ) e-kKr+'nm* D (1 _ e ,-Kf' +l /(m + \)yi - k (3 . 41b) 

Clearly, Eqs. (3.39)—(3.41) can be studied and evaluated by a parametric 
computer study. In many cases, it is useful to approximate the result, although 
numerical evaluation via a computer program is not difficult. For an r-out-of-n 
structure of identical components, the exact reliability expression is given by 
Eq. (3.38). As is well known, we can approximate the binomial distribution by 
the Poisson or normal distributions, depending on the values of n and p (see 
Shooman, 1990, Sections 2.5.6 and 2.6.8). Interestingly, we can also develop 
similar approximations for the case in which the n parameters are not identical. 

The Poisson approximation to the binomial holds for p < 0.05 and n > 20, 
which represents the low-reliability region. If we are interested in the high- 
reliability region, we switch to failure probabilities, requiring q = 1 - p < 0.05 
and n > 20. Since we are assuming different components, we define average 
probabilities of success and failure p and q as 

1 " 1 " 

P = — X Pi = 1 - q = 1- X (1 - Pi) (3.42) 

n (=i n i= i 

Thus, for the high-reliability region, we compute the probability of n-r or fewer 
failures as 


n-r 


R(t ) = X 

k = 0 


(riq) k e nq 

kl 


(3.43) 


and for the low-reliability region, we compute the probability of r or more 
successes as 


n 

R(t) = X 


k = r 


(npfe np 

jfc! 


(3.44) 


Equations (3.43) and (3.44) avoid a great deal of algebra in dealing with 
nonidentical r-out-of-n components. The question of accuracy is somewhat dif- 
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ficult to answer since it depends on the system structure and the range of values 
of p that make up p. For example, if the values of q vary only over a 2 : 1 range, 
and if q < 0.05 and n > 20, intuition tells us that we should obtain reasonably 
accurate results. Clearly, modern computer power makes explicit enumeration 
of Eqs. (3.39)—(3.41) a simple procedure, and Eqs. (3.43) and (3.44) are useful 
mainly as simplified analytical expressions that provide a check on computa¬ 
tions. [Note that Eqs. (3.43) and (3.44) also hold true for IIU with p = p and 

q = <?•] 

We can appreciate the power of an r : n design by considering the following 
example. Suppose we have a fiber-optic cable with 20 channels (strands) and a 
system that requires all 20 channels for success. (For simplicity of the discus¬ 
sion, assume that the associated electronics will not fail.) Suppose the proba¬ 
bility of failure of each channel within the cable is q = 0.0005 and p =0.9995. 
Since all 20 channels are needed for success, the reliability of a 20-channel 
cable will be R 2 o = (0.9995) 20 = 0.990047. Another option is to use two paral¬ 
lel 20-channel cables and associated electronics switch from cable A to cable 
B whenever there is any failure in cable A. The reliability of such an ordinary 
parallel system of two 20-channel cables is given by R 2 go = 2(0.990047) - 
(0.990047) 2 = 0.9999009. Another design option is to include extra channels 
in the single cable beyond the 20 that are needed—in such a case, we have an 
r : n system. Suppose we approach the design in a trial-and-error fashion. We 
begin by trying n =21 channels, in which case we have 

R 21 = 5(21 : 21) + 5(20: 21) = p 2l q° + 2\p 20 q 

= (0.9995) 21 + 21(0.9995) 20 (0.0005) = 0.98755223 + 0.010395497 

= 0.999947831 (3.45) 

Thus R 2 i exceeds the design with two 20-channel cables. Clearly, all the 
designs require some electronic steering (couplers) for the choice of channels, 
and the coupler reliability should be included in a detailed comparison. Of 
course, one should worry about common mode failures, which could com¬ 
pletely change the foregoing results. Construction damage—that is, line-sev¬ 
ering by a contractor’s excavating maching (backhoe)—is a significant failure 
mode for in-soil fiber-optic telephone lines. 

As a check on Eq. (3.45), we compute the approximation Eq. (3.43) for n 
= 21, r=20. 


*=o k\ 


= 0.999831687 


(1 + nq)e nq = [1 + 21 (0.0005)] W 22x 00005 


(3.46) 


These values are summarized in Table 3.1. 
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TABLE 3.1 Comparison of Design for Fiber-Optic Cable 
Example 


System 

Reliability, R 

Unreliability, 

(1 -R) 

Single 20-channel cable 

0.990047 

0.00995 

Two 20-channel cables 

0.9999009 

0.000099 

in parallel 

A 21-channel cable (exact) 

0.999948 

0.000052 

A 21-channel cable (approx.) 

0.99983 

0.00017 


Essentially, the efficiency of the r : n system is because the redundancy is 
applied at a lower level. In practice, a 24- or 25-channel cable would probably 
be used, since a large portion of the cable cost would arise from the land used 
and the laying of the cable. Therefore, the increased cost of including four or 
five extra channels would be “money well spent,” since several channels could 
fail and be locked out before the cable failed. If we were discussing the number 
of channels in a satellite communications system, the major cost would be the 
launch; the economics of including a few extra channels would be similar. 


3.7 STANDBY SYSTEMS 
3.7.1 Introduction 

Suppose we consider two components, x\ and x\, in parallel. For discussion 
purposes, we can think of X\ as the primary system and x\ as the backup; 
however, the systems are identical and could be interchanged. In an ordinary 
parallel system, both x\ and x\ begin operation at time t= 0, and both can fail. 
If t\ is the time to failure of x\, and t 2 is the time to failure of * 2 , then the time 
to system failure is the maximum value of (U,^)- An improvement would be 
to energize the primary system x\ and have backup system x\ unenergized so 
that it cannot fail. Assume that we can immediately detect the failure of X\ and 
can energize x\ so that it becomes the active element. Such a configuration is 
called a standby system, xi is called the on-line system, and xj the standby 
system. Sometimes an ordinary parallel system is called a “hot” standby, and 
a standby system is called a “cold” standby. The time to system failure for 
a standby system is given by t = t\ + t 2 - Clearly, t\ + t 2 > max(t l ,t 2 ), and a 
standby system is superior to a parallel system. The “coupler” element in a 
standby system is more complex than in a parallel system, requiring a more 
detailed analysis. 

One can take a number of different approaches to deriving the equations for 
a standby system. One is to determine the probability distribution of t = t\ + t 2 , 
given the distributions of t\ and to [Papoulis, 1965, pp. 193-194]. Another 
approach is to develop a more general system of probability equations known 
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TABLE 3.2 States for a Parallel System 

JO = x\X 2 = Both components good. 
Jl = X\X 2 = X], good; X 2 , failed. 

S 2 = x\X 2 = X], failed; X 2 , good. 

J 3 = xiX 2 = Both components failed. 


as Markov models. This approach is developed in Appendix B and will be 
used later in this chapter to describe repairable systems. 

In the next section, we take a slightly simpler approach: we develop two 
difference equations, solve them, and by means of a limiting process develop 
the needed probabilities. In reality, we are developing a simplified Markov 
model without going through some of the formalism. 

3.7.2 Success Probabilities for a Standby System 

One can characterize an ordinary parallel system with components x\ and X 2 by 
the four states given in Table 3.2. If we assume that the standby component in 
a standby system won’t fail until energized, then the three states given in Table 
3.3 describe the system. The probability that element x fails in time interval At 
is given by the product of the failure rate X (failures per hour) and At. Similarly, 
the probability of no failure in this interval is (1 - XAf). We can summarize 
this information by the probabilistic state model (probabilistic graph, Markov 
model) shown in Fig. 3.11. 

The probability that the system makes a transition from state .v 0 to state ,v i in 
time At is given by \i At, and the transition probability for staying in state ,?o is 
(1 — Ai A/). Similar expressions are shown in the figure for staying in state ,v i or 
making a transition to state j 2 . The probabilities of being in the various system 
states at time t = t + At are governed by the following difference equations: 

P so (t + At) = (1 - X, A t)P S0 (t), (3.47a) 

P si (t + At) = XiA tP S0 (t) + (1 - X 2 A t)P sl (t) (3.47b) 

Ps 2 .it + A t) = X 2 A tP sl ( t ) + (l)P„(f) (3.47c) 

We can rewrite Eq. (3.47) as 


TABLE 3.3 States for a Standby System 

JO = x\X 2 = On-line and standby components good. 

Jl = xix 2 = On-line failed and standby component good. 
j 2 = xix 2 = On-line and standby components failed. 



106 REDUNDANCY, SPARES, AND REPAIRS 


1 - X, At 1 - \ 2 At I 



P so (t + A t)- P so (t ) = -Xi AtP S0 (t) (3.48a) 


P so (t + At)- P so (t ) 
At 


-XiP S0 (t) 


(3.48b) 


Taking the limit of the left-hand side of Eq. (3.48b) as At —> 0 yields the time 
derivative, and the equation becomes 

dPs f ] + A i P so = 0 (3.49) 


This is a linear, first-order, homogeneous differential equation and is known to 
have the solution P SQ = Ae Xl '. To verify that this is a solution, we substitute 
into Eq. (3.49) and obtain 


-XiAfT^ + AiAe^sO 

The value of A is determined from the initial condition. If we start with a good 
system, P so (t = 0) = 1; thus A = 1 and 

P so = e Xlt (3.50) 

In a similar manner, we can rewrite Eq. (3.47b) and take the limit obtaining 

+ X^(0 = Xi^ 0 (3.51) 

This equation has the solution 

P Sl (t) = B ie - Xlt + B 2 e-^‘ (3.52) 

Substitution of Eq. (3.52) into Eq. (3.51) yields a group of exponential terms 
that reduces to 
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[X 2 fl, - X^ - Xi]<r Xlt = 0 


(3.53) 


and solving for B\ yields 


X 

X2 - Xi 


(3.54) 


We can obtain the other constant by substituting the initial condition P si (t = 0) 
= 0, and solving for B 2 yields 


Bi - ~B 1 - 



(3.55) 


The complete solution is 

P.u(t) = (3-56) 

At - Ai 

Note that the system is successful if we are in state 0 or state 1 (state 2 is 
a failure). Thus the reliability is given by 

R(t) = P so (t) + P sl (t) (3.57) 

Equation (3.57) yields the reliability expression for a standby system where 
the on-line and the standby components have two different failure rates. In the 
more general case, both the on-line and standby components have the same 
failure rate, and we have a small difficulty since Eq. (3.56) becomes 0/0. The 
standard approach in such cases is to use THospital’s rule from calculus. The 
procedure is to take the derivative of the numerator and the denominator sep¬ 
arately with respect to Xt; then to take the limit as X 2 —> X 1 . This results in 
the expression for the reliability of a standby system with two identical on-line 
and standby components: 


R(t) = e- Xt + \te- Xt (3.58) 

A few general comments are appropriate at this point. 

1. The solution given in Eq. (3.58) can be recognized as the first two terms 
in the Poisson distribution, the probability of zero occurrences in time 
t plus the probability of one occurrence in time t hours, where X is the 
occurrence rate per hour. Since the “exposure time” for the standby com¬ 
ponent does not start until the on-line element has failed, the occurrences 
are a sequence in time that follows the Poisson distribution. 

2. The model in Fig. 3.11 could have been extended to the right to incorpo¬ 
rate a very large number of components and states. The general solution 
of such a model would have yielded the Poisson distribution. 
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3. A model could have been constructed composed of four states: (x\Xi, 
X\X 2 , x\X 2 , X 1 X 2 ). Solution of this model would yield the probability 
expressions for a parallel system. However, solution of a parallel system 
via a Markov model is seldom done except for tutorial purposes because 
the direct methods of Section 3.5 are simpler. 

4. Generalization of a probabilistic graph, the resulting differential equa¬ 
tions, the solution process, and the summing of appropriate probabilities 
leads to a generalized Markov model. This is further illustrated in the 
next section on repair. 

5. In Section 3.8.2 and Chapter 4, we study the formulation of Markov 
models using a more general algorithm to derive the equations, and we 
use Laplace transforms to solve the equations. 


3.7.3 Comparison of Parallel and Standby Systems 

It is assumed that the reader has studied the material in Sections A8 and B6 
that cover Markov models. We now compare the reliability of parallel and 
standby systems in this section. Standby systems are inherently superior to 
parallel systems; however, much of this superiority depends on the reliability of 
the standby switch. Also, the reliability of the coupler in a parallel system must 
also be considered in the comparison. The reliability of the standby system 
with an imperfect switch will require a more complex Markov model than 
that developed in the previous section, and such a model is discussed below. 

The switch in a standby system must perform three functions: 

1. It must have some sort of decision element or algorithm that is capable 
of sensing improper operation. 

2. The switch must then remove the signal input from the on-line unit and 
apply it to the standby unit, and it must also switch the output as well. 

3. If the element is an active one, the power must be transferred from the 
on-line to the standby element (see Fig. 3.12). In some cases, the input 
and output signals can be permanently connected to the two elements; 
only the power needs to be switched. 

Often the decision unit and the input (and output) switch can be incorpo¬ 
rated into one unit: either an analog circuit or a digital logic circuit or processor 
algorithm. Generally, the power switch would be some sort of relay or elec¬ 
tronic switch, or it could be a mechanical device in the case of a mechanical, 
hydraulic, or pneumatic system. The specific implementation will vary with 
the application and the ingenuity of the designer. 

The reliability expression for a two-element standby system with constant 
hazards and a perfect switch was given in Eqs. (3.50), (3.56), and (3.57) and 
for identical elements in Eq. (3.58). We now introduce the possibility that the 
switch is imperfect. 


Tcam-Flij 
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Figure 3.12 A standby system in which input and power switching are shown. 


We begin with a simple model for the switch where we assume that any 
failure of the switch is a failure of the system, even in the case where both the 
on-line and the standby components are good. This is a conservative model that 
is easy to formulate. If we assume that the switch failures are independent of 
the on-line and standby component failures and that the switch has a constant 
failure rate \ s , then Eq. (3.58) holds. Thus we obtain 

R\{t) = e Xs ‘(e x ' + \te x ') (3.59) 

Clearly, the switch reliability multiplies the reliability of the standby sys¬ 
tem and degrades the system reliability. We can evaluate how significant the 
switch reliability problem is by comparing it with an ordinary parallel system. 
A comparison of Eqs. (3.59) and (3.30) (for n = 2 and identical failure rates) 
is given in Fig. 3.13. Note that when the switch failure rate is only 10% of the 
component failure rates (A, = 0.1A), the degradation is only minor, especially 
in the high-reliability region of most interest: (1 > R(t) > 0.9). The standby 
system degrades to about the same reliability as the parallel system when the 
switch failure rate is about half the component failure rate. 

A simple way to improve the switch reliability model is to assume that the 
switch failure mode is such that it only fails to switch from on-line to standby 
when the on-line element fails (it never switches erroneously when the on-line 
element is good). In such a case, the probability of no failures is a good state 
and the probability of one failure and no switch failure is also a good state, 
that is, the switch reliability only multiplies the second term in Eq. (3.58). In 
such a case, the reliability expression becomes 
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Figure 3.13 A comparison of a two-element ordinary parallel system with a two- 
element standby system with imperfect switch reliability. 


R 2 (t) = e~ Xr + \te~ x V v (3.60) 

Clearly, this is less conservative and a more realistic switch model than the 
previous one. 

One can construct even more complex failure models for the switch in a 
standby system [Shooman, 1990, Section 6.9]. 

1. Switch failure modes where the switching occurs even when the on-line 
element is good or where the switch jitters between elements can be 
included. 

2. The failure rate of n nonidentical standby elements was first derived by 
Bazovsky [1961, p. 117]; this can be shown as related to the gamma dis¬ 
tribution and to approach the normal distribution for large n [Shooman, 
1990]. 

3. For n identical standby elements, the system succeeds if there are n— 1 or 
fewer failures, and the probabilities are given by the Poisson distribution 
that leads to the expression 
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R(t) = e- Xt 


n- 1 


Z 

i=0 


m 


(3.61) 


3.8 REPAIRABLE SYSTEMS 
3.8.1 Introduction 

Repair or replacement can be viewed as the same process, that is, replacement 
of a failed component with a spare is just a fast repair. A complete description 
of the repair process takes into account several steps: (a) detection that a failure 
has occurred; (b) diagnosis or localization of the cause of the failure; (c) the 
delay for replacement or repair, which includes the logistic delay in waiting 
for a replacement component or part to arrive; and (d) test and/or recalibration 
of the system. In this section, we concentrate on modeling the basics of repair 
and will not decompose the repair process into a finer model that details all of 
these substates. 

The decomposition of a repair process into substates results in a non- 
constant-repair rate (see Shooman [1990, pp. 348-350]). In fact, there is evi¬ 
dence that some repair processes lead to lognormal repair distributions or other 
nonconstant-repair distributions. One can show that a number of distributions 
(e.g., lognormal, Weibull, gamma, Erlang) can be used to model a repair pro¬ 
cess [Muth, 1967, Chapter 3]. Some software for modeling system availabil¬ 
ity permits nonconstant-failure and -repair rates. Only in special cases is such 
detailed data available, and constant-repair rates are commonly used. In fact, 
it is not clear how much difference there is in compiling the steady-state avail¬ 
ability for constant- and nonconstant-repair rates [Shooman, 1990, Eq. (6.106) 
ff.]. For a general discussion of repair modeling, see Ascher [1984]. 

In general, repair improves two different measures of system performance: 
the reliability and the availability. We begin our discussion by considering a 
single computer and the following two different types of computer systems: 
an air traffic control system and a hie server that provides electronic mail and 
network access to a group of users. Since there is only a single system, a 
failure of the computer represents a system failure, and repair will not affect 
the system reliability function. The availability of the system is a measure of 
how much of the operating time the system is up. In the case of the air traffic 
control system, the fact that the system may occasionally be down for short 
time periods while repair or replacement goes on may not be tolerable, whereas 
in the case of the hie server, a small amount of downtime may be acceptable. 
Thus a computation of both the reliability and the availability of the system is 
required; however, for some critical applications, the most important measure 
is the reliability. If we say the basic system is composed of two computers in 
parallel or standby, then the problem changes. In either case, the system can 
tolerate one computer failure and stay up. It then becomes a race to see if the 
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failed element can be repaired and restored before the remaining element fails. 
The system only goes down in the rare event that the second component fails 
before the repair or replacement is completed. 

In the following sections, we will model a two-element parallel and a two- 
element standby system with repair and will comment on the improvements in 
reliability and availability due to repair. To facilitate the solutions of the ensu¬ 
ing Markov models, some simple features of the Laplace transform method will 
be employed. It is assumed that the reader is familiar with Laplace transforms 
or will have already read the brief introduction to Laplace transform methods 
given in Appendix B, Section B8. We begin our discussion by developing a 
general Markov model for two elements with repair. 


3.8.2 Reliability of a Two-Element System with Repair 

The benefits of repair in improving system reliability are easy to illustrate in a 
two-element system, which is the simplest system used in high-reliability fault- 
tolerant situations. Repair improves both a hot standby and a cold standby sys¬ 
tem. In fact, we can use the same Markov model to describe both situations if 
we appropriately modify the transition probabilities. A Markov model for two 
parallel or standby systems with repair is given in Fig. 3.14. The transition rate 
from state so to Ji is given by 2A in the case of an ordinary parallel system 
because two elements are operating and either one can fail. In the case of a 
standby system, the transition is given by A since only one component is pow¬ 
ered and only that one can fail (for this model, we ignore the possibility that 
the standby system can fail). The transition rate from state sj to so represents 
the repair process. If only one repairman is present (the usual case), then this 
transition is governed by the constant repair rate fx. In a rare case, more than 
one repairman will be present, and if all work cooperatively, the repair rate is 
> ji. In some circumstances, there will be only a shared repairman among a 
number of equipments, in which case the repair rate is <jx. 

In many cases, study of the repair statistics shows a nonexponential distri¬ 
bution (the exponential distribution is the one corresponding to a constant tran¬ 
sition rate)—specifically, the lognormal distribution [Ascher, 1984; Shooman, 
1990, pp. 348-350]. However, much of the benefits of repair are illustrated by 


1 - A'A/ 1 - (X + n')At 1 





* 1*2 



where 


A' = 2X 

for an ordinary system 

X' = A 

for a standby system 

m' = M 

for one repairman 

pi' = kn 

for more than one 


repairman (Jfc > 1) 


Figure 3.14 A Markov reliability model for two identical parallel elements and k 
repairmen. 
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the constant transition rate repair model. The Markov equations corresponding 
to Fig. 3.14 can be written by utilizing a simple algorithm: 

1. The terms with 1 and At in the Markov graph are deleted. 

2. A first-order Markov differential equation is written for each node where 
the left-hand side of the equation is the first-order time derivative of the 
probability of being in that state at time t. 

3. The right-hand side of each equation is a sum of probability terms for 
each branch that enters the node in question. The coefficient of each 
probability term is the transition probability for the entering branch. 


We will illustrate the use of these steps in formulating the Markov of Fig. 
3.14. 

dP jf L = -\'P S0 (t) + ix'PJt) (3.62a) 

= \'P S0 (t ) - (X + (0 (3.62b) 

dP^n _yp^t) (3.62c) 


Assuming that both systems are initially good, the initial conditions are 

P. so (0) = l, P SI (0) = P S2 (0) = 0 

One great advantage of the Laplace transform method is that it deals simply 
with initial conditions. Another is that it transforms differential equations in the 
time domain into a set of algebraic equations in the Laplace transform domain 
(often called the frequency domain), which are written in terms of the Laplace 
operator 5. 

To transform the set of equations (3.62a-c) into the Laplace domain, we 
utilize transform theorem 2 (which incorporates initial conditions) from Table 
B7 of Appendix B, yielding 


sP S0 (s) - 1 = -X'P S0 (s) + n'P sl (s ) 
sP sl (s) - 0 = \'P S0 (s) - (X + n')P Si (s) 
sP S2 (s) - 0 = \P S] (.s’) 


Writing these equations in a more symmetric form yields 


(3.63a) 

(3.63b) 

(3.63c) 
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(s + \')P S0 (s) - lx'P sl (s) = 1 

(3.64a) 

\'P S0 (s) + (s + t JL' +\)P sl (s) = 0 

(3.64b) 

-\P sl (s) + sP S2 (s) = 0 

(3.64c) 


Clearly, Eqs. (3.64a-c) lead to a matrix formulation if desired. However, 
we can simply solve these equations using Cramer’s rule since they are now 
algebraic equations. 


Ps 0 (s) = 

(.9 + X + fx') 

(3.65a) 

[s 2 + (X + X' + n')s + XX'] 

P n (s) = 

X' 

(3.65b) 

[s 2 + (X + X' + n')s + XX'] 

PAs) = 

XX' 

(3.65c) 

s[s 2 + (X + X' + n')s + XX'] 


We must now invert these equations—transform them from the frequency 
domain to the time domain—to find the desired time solutions. There are sev¬ 
eral alternatives at this point. One can apply transform No. 10 from Table B6 
of Appendix B to Eqs. (3.65a, b) to obtain the solution as a sum of two expo¬ 
nentials, or one can use a partial fraction expansion as illustrated in Eq. (B104) 
of the appendix. An algebraic solution of these equations using partial fractions 
appears in Shooman [1990, pp. 341-342], and further solution and plotting of 
these equations is covered in the problems at the end of this chapter as well as 
in Appendix B8. One can, however, make a simple comparison of the effects 
of repair by computing the MTTF for the various models. 

3.8.3 MTTF for Various Systems with Repair 

Rather than compute the complete reliabiity function of the several systems 
we wish to compare, we can simplify the analysis by comparing the MTTF 
for these systems. Furthermore, the MTTF is given by an integral of the reli¬ 
ability function, and by using Faplace theory we can show [Section B8.2, Eqs. 
(B105)—(B106)] that the MTTF is just given by the limit of the Faplace trans¬ 
form expression as s —> 0. 

For the model of Fig. 3.14, the reliability expression is the sum of the first 
two-state probabilities; thus, the MTTF is the limit of the sum of Eqs. (3.65a, 
b) as 5 —> 0, which yields 


MTTF = 


X + n' + X' 
(XX') 


(3.66) 
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TABLE 3.4 Comparison of MTTF for Several Systems 


Element 

Formula 

For X = 1, 

p. = 10 

Single element 

1/X 

1.0 

Two parallel elements—no repair 

1.5/X 

1.5 

Two standby elements—no repair 

2/X 

2.0 

Two parallel elements—with repair 

(3X + aO/2X 2 

6.5 

Two standby elements—with repair 

(2X + ju)/X 2 

12.0 


We substitute the various values of X' shown in Fig. 3.14 in the expression; 
since we are assuming a single repairman, fi' = pi. The MTTF for several sys¬ 
tems is compared in Table 3.4. Note how repair strongly increases the MTTF 
of the last two systems in the table. For large fx/\ ratios, which are common 
in practice, the MTTF of the last two systems approaches 0.5 fx/\ 2 and p,/X 2 . 


3.8.4 The Effect of Coverage on System Reliability 

In Fig. 3.12, we portrayed a fairly complex block diagram for a standby sys¬ 
tem. We have already modeled the possibility of imperfection in the switch¬ 
ing mechanism. In this section, we develop a model for imperfections in the 
decision unit that detects failures and switches from the on-line system to the 
standby system. In some cases, even in the /7-ordinary parallel system (hot 
standby), it is not possible to have both systems fully connected, and a deci¬ 
sion unit and switch are needed. Another way of describing this phenomenon 
is to say that the decision unit cannot detect 100% of all the on-line unit fail¬ 
ures; it only “covers” (detects) the fraction c (0 < c < 1) of all the possible 
failures. (The formulation of this concept is generally attributed to Bouricius, 
Carter, and Schneider [1969].) The problem is that if the decision unit does 
not detect a failure of the on-line unit, input and output remain connected to 
the failed on-line element. The result is a system failure, because although the 
standby unit is good, there is no indication that it must be switched into use. 
We can formulate a Markov model in Fig. 3.15, which allows us to evaluate 
the effect of coverage. (Compare with the model of Fig. 3.14.) In fact, we can 
use Fig. 3.15 to model the effects of coverage on either a hot or cold standby 
system. Note that the symbol D stands for the decision unit correctly detecting 
a failure in the on-line unit, and the symbol D means that the decision unit 
has not been able to (failed to) detect a failure in the on-line unit. Also, a new 
arc has been added in the figure from the good state sq to the failed state vi 
for modeling the failure of the decision unit to “cover” a failure of the on-line 
element. 

The Markov equations for Fig. 3.15 become the following: 
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X"A t 



where X' = 2cX for an ordinary parallel system 

X" = 2(1 - c)X for an ordinary parallel system 
X' = cX for a standby system 
X" = (1 - c)X for a standby system 
)j' = n for one repairman 

Figure 3.15 A Markov reliability model for two identical, parallel elements, k repair¬ 
men, and coverage effects. 


sP S0 (s) - 1 = -(X' + \")Ps 0 (s) + ix'P H (s) (3.67a) 

sP sl (s) - 0 = \'P S0 (s) - (X + p')P n (s) (3.67b) 

sP S2 (s) -0 = \”P S0 (s ) + \P sl (s) (3.67c) 


Compare the preceding equations with Eqs. (3.63a-c) and (3.64a-c). Writing 
these equations in a more symmetric form yields 


(s + X' + X")P S0 (s) - n'P n (s) = 1 (3.68a) 

- \'P S0 (s ) + (s + ix' + \)P sl (s) = 0 (3.68b) 

-\"P S0 (s) - \P sl (s ) + sP S2 (s) = 0 (3.68c) 

The solution of these equations yields 


P s o(i) 


P n (s) 


(s + X + //) 

5 2 + (X + X' + X" + fi')s + (XX' + XV + XX") 
X' 

5 2 + (X + X' + X" + /Os + (XX' + XV + XX") 


/x 2 0ri 


\"s + XX' + M 'X" + XX" 

s[s 2 + (X + X' + X" + /x')5 + (XX' + XV + XX")] 


(3.69a) 

(3.69b) 

(3.69c) 


For the model of Fig. 3.15, the reliability expression is the sum of the first 
two-state probabilities; thus the MTTF is the limit of the sum of Eqs. (3.69a, 
b) as 5 -A 0, which yields 
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TABLE 3.5 Comparison of MTTF for Several Systems 


Element 


Single element 

Two parallel el 
: 0, X' = 


1 / 

X" = 

WO si 

[/ 

X" = 


[/ 

X" 


1 / 

X" 


2(1 


= 2 cX, 
c)X] 


= 0, X' = cX, 

(1 - c)X] 

irallel element 
- H,\' = 2cX, 

= 2(1 - c)X] 

mdby elemen 

= n, X = cX, 

= (1 - c)X] 




For 

For 

For 



X= 1. 

X= 1 

X= 1, 



M = 10, 

ix = 10, 

ix = 10, 


Formula 

c = 1 

c = 0.95 

c = 0.90 


1/X 

1.0 

— 

— 

-no repair: 

(0.5 + c)/X 

1.5 

1.45 

1.40 

-no repair: 

(1 + c)/X 

2.0 

1.95 

1.90 

-with repair: 

(1 + 2c)A + jix 

6.5 

4.3 

3.2 

2X[X + (1 - c)jx | 

-with repair: 

(1 + c)X + fX 

12.0 

7.97 

5.95 

X[X + (1 - c)ix] 


MTTF = 


X + p' + X' 

(XX' + XV + XX") 


(3.70) 


When c =1, X" =0, and we see that Eq. (3.70) reduces to Eq. (3.66). 
The effect of coverage on the MTTF is evaluated in Table 3.5 by making 
appropriate substitutions for X', X", and //. Notice what a strong effect the 
coverage factor has on the MTTF of the systems with repair. For two parallel 
and two standby systems, c = 0.90—more than half the MTTF. Practical values 
for c are hard to find in the literature and are dependent on design. Sieworek 
[1992, p. 288] comments, “a typical diagnostic program, for example, may 
detect only 80-90% of possible faults.” Bell [1978, p. 91] states that static 
testing of PDP-11 computers at the end of the manufacturing process was able 
to find 95% of faults, such as solder shorts, open-circuit etch connections, dead 
components, and incorrectly valued resistors. Toy [1987, p. 20] states, “realistic 
coverages range between 95% and 99%.” Clearly, the value of c should be a 
major concern in the design of repairable systems. 

A more detailed treatment of coverage can be found in the literature. See 
Bouricius and Carter [1969, 1971]; Dugan [1989, 1996]; Kaufman and Johnson 
[1998]; and Pecht [1995]. 


3.8.5 Availability Models 

In some systems, it is tolerable to have a small amount of downtime as the 
system is rapidly repaired and restored. In such a case, we allow repair out 
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where X' = 2X for an ordinary system 
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Figure 3.16 Markov availability graph for two identical parallel elements. 


of the system down state, and the model of Fig. 3.16 is obtained. Note that 
Fig. 3.14 and Fig. 3.16 only differ in the repair branch from state S 2 to state si. 
Using the same techniques that we used above, one can show that the equations 
for this model become 


(s + \')P S0 (s)-ix'P sl (s) = \ (3.71a) 

-\'P S0 (s) + (s + [X + X )P sl (s) - ix"P S2 (s) = 0 (3.71b) 

-\P sl (s) + (s + ix")P S2 (s ) = 0 (3.71c) 

See Shooman [1990, Section 6.10] for more information. 

The solution follows the same procedure as before. In this case, the sum of 
the probabilities for states 0 and 1 is not the reliability function but the avail¬ 
ability function: A{t). In most cases, A(t) does not go to 0 as t —» °°, as is true 
with the R(t) function. A(t) starts at 1 and, for well-designed systems, decays to 
a steady-state value close to 1. Thus a lower bound on the availability function 
is the steady-state value. A simple means for solving for the steady-state value 
is to formulate the differential equations for the Markov model and set all the 
time derivatives to 0. The set of equations now becomes an algebraic set of 
equations; however, the set is not independent. We obtain an independent set 
of equations by replacing any of these equations by the equation—the sum of 
all the state probabilities =1. The algebraic solution for the steady-state avail¬ 
ability is often used in practice. An even simpler procedure for computing the 
steady-state availability is to apply the final value theorem to the transformed 
expression for A(s). This method is used in Section 4.9.2. 

This chapter and Chapter 4 are linked in many ways. The technique of vot¬ 
ing reliability joins parallel and standby system reliability as the three most 
common techniques for fault tolerance. Also, the analytical techniques involv¬ 
ing Markov models are used in both chapters. In Chapter 4, a comparison is 
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made of the reliability and availability of parallel, standby, and voting systems; 
in addition, some of the Markov modeling begun in this chapter is extended 
in Chapter 4 for the purpose of this comparison. The following chapter also 
has a more extensive discussion of the many shortcuts provided by Laplace 
transforms. 


3.9 RAID SYSTEMS RELIABILITY 
3.9.1 Introduction 

The reliability techniques discussed in Chapter 2 involved coding to detect 
and correct errors in data streams. In this chapter, various parallel and standby 
techniques have been introduced that significantly increase the reliability of 
various systems and components. This section will discuss a newly developed 
technology for constructing computer secondary-storage systems that utilize 
the techniques of both Chapters 2 and 3 for the design of reliable, compact, 
high-performance storage systems. The generic term for such memory sys¬ 
tem technology is redundant disk arrays [Gibson, 1992]; however, it was soon 
changed to redundant array of inexpensive disks (RAID), and as technology 
evolved so that the quality and capacity of small disks rapidly increased, the 
word “inexpensive” was replaced by “independent.” The term “array,” when 
used in this context, means a collection of many disks organized in a specific 
fashion to improve speed of data transfer and reliability. As the RAID tech¬ 
nology evolved, cache techniques (the use of small, very high-speed memories 
to accelerate processing by temporarily retaining items expected to be needed 
again soon) were added to the mix. Many varieties of RAID have been devel¬ 
oped and more will probably emerge in the future. The RAID systems that 
employ cache techniques for speed improvement are sometimes called cached 
array of inexpensive disks (CAID) [Buzen, 1993]. The technology is driven 
by the variety of techniques available for connecting multiple disks, as well as 
various coding techniques, alternative read-and-write techniques, and the flexi¬ 
bility in organization to “tune” the architecture of the RAID system to match 
various user needs. 

Prior to 1990, the dominant technology for secondary storage was a group 
of very large disks, typically 5-15, in a cabinet the size of a clothes washer. 
Buzen [1993] uses the term single large expensive disk (SLED) to refer to 
this technology. RAID technology utilizes a large number, typically 50-100, 
of small disks the size of those used in a typical personal computer. Each disk 
drive is assumed to have one actuator to position reads or writes, and large 
and small drives are assumed to have the same I/O read- or write-time. The 
bandwidth (BW) of such a disk is the reciprocal of the read-time. If data is bro¬ 
ken into “chunks” and read (written) in parallel chunks to each of the n small 
disks in a RAID array, the effective BW increases. There is some “overhead” 
in implementing such a parallel read-write scheme, however, in the limit: 
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effective bandwidth —> nBW (3.72) 


Thus, one possible beneficial effect of a RAID configuration in which many 
disks are written in parallel is a large increase in the BW. 

If the RAID configuration depends on all the disks working, then the reli¬ 
ability of so many disks is lower than a smaller number of large disks. If the 
failure rate of each of the n disks is denoted by X = 1/MTTF, then the failure 
rate and MTTF of n disks is given by 


effective failure rate = rik = 1/effective MTTF = n/MTTF (3.73) 


The failure rate is n times as large and the MTTF is n times smaller. If data 
is stored in “chunks” over many disks so that the write operation occurs in 
parallel for increased BW, the reliability of the block of data decreases signif¬ 
icantly as per Eq. (3.73). Writing data in a distributed manner over a group 
of disks is called striping or interleaving. The size of the “chunk” is a design 
parameter in striping. To increase the reliability of a striped array, one can use 
redundant disks and/or error-detecting and -correcting codes for “chunks” of 
data of various sizes. We have purposely used the nonspecific term “chunk” 
because one of the design choices, which will soon be discussed, is the size 
of the “chunk” and how “the chunk” is distributed across various disks. 

The various trade-offs among objectives and architectural approaches have 
changed over the decade (1990-2000) in which RAID storage systems were 
developed. At the beginning, small disks had modest capacity, longer access 
and transfer times, higher cost, and lower reliability. The improvements in all 
these parameters have had major effects on design. 

The designers of RAID systems utilize various techniques of redundancy 
and error-correcting codes to raise the reliability of the RAID sysem [Buzen, 
1993]. The early literature defined six levels of RAID [Patterson, 1987, 1988; 
Gibson, 1992], and most manufacturers followed these levels as guidelines in 
describing their products. However, as variations and options developed, classi¬ 
fication became difficult, and some marketing managers took the classification 
system to mean that a higher level of RAID meant a better system. Thus, one 
vendor whose system included features of RAID 2 and RAID 5 decided to call 
his product RAID 10, claiming the levels multiplied! [Friedman, 1996.] Situ¬ 
ations such as these led to the creation of the RAID Advisory Board, which 
serves as an industry standards body to define seven (and possibly more) lev¬ 
els of RAID [RAID, 1995; Massaglia, 1997]. The basic levels of RAID are 
given in Table 3.6, and the reader is cautioned to remember that because the 
various levels of RAID are to differentiate architectural approach, an increase 
in level does not necessarily correspond to an increase in BW or reliability. 
Complexity, however, does probably increase as the RAID level increases. 
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TABLE 3.6 Summary Comparison of RAID Technologies 

Level Common Name Features 


0 


1 


2 


3 


4 

5 

6 


No RAID or JBOD 

(“just a bunch of disks”). 


Mirrored disks 

(duplexing, shadowing). 


Hamming error-correcting 
code with bit-level 
interleaving. 

Parity-bit code at the 
bit level. 


Parity-bit code at the 
block level. 

Parity-bit code at the 
sector level. 

Parity-bit code at the 
bit level applied in 
two ways to provide 
correction when two 
disks fail. 


No redundancy; thus, many claim that to 
consider this RAID is a misnomer. A 
Level 0 system could have a striped 
array and even a cache for speed im¬ 
provement. There is, however, decreased 
reliability compared to a single disk 
if striping is employed, and the BW is 
increased. 

Two physical disk drives store identical 
copies of the data, one on each drive. 

This concept may be generalized to n 
drives with n identical copies or to k 
sets of pairs with identical data. It is a 
simple scheme with high reliability and 
speed improvement, but there is high cost. 

Hamming SECSED (SECDED) code is 
computed on the data blocks and is striped 
across the disks. It is not often used in 
practice. 

A parity-bit code is applied at the bit level 
and the parity bits are stored on a 
separate parity disk. Since parity bits are 
calculated for each strip, and strips 
appear on different disks, error detection 
is possible with a simple parity code. The 
parity disk must be accessed on all reads; 
generally, the disk spindles are 
synchronized. 

A parity-bit code is applied at the block level, 
and the parity bits are stored on a 
separate parity disk. 

A parity-bit code is applied at the sector level 
and the parity information is distributed 
across the disks. 

Parity is computed in two different independent 
manners so that the array can recover from 
two disk failures. 


Source: [The RAID book, 1995]. 
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3.9.2 RAID Level 0 

This level was introduced as a means of classifying techniques that utilize 
a disk array and striping to improve the BW; however, no redundancy is 
included, and the reliability decreases. Equations (3.72) and (3.73) describe 
these basic effects. The BW of the array has increased over individual disks, 
but the reliability has decreased. Since high reliability is generally required 
in the disk storage system, this level would rarely be used except for special 
applications. 

3.9.3 RAID Level 1 

The use of mirrored disks is an obvious way to improve reliability; if the two 
disks are written in parallel, the BW is increased. If the data is striped across 
the two disks, the parallel reading of a transaction can increase the BW by 
a factor of 2. However, the second (backup) copy of the transaction must be 
written, and if there is a continuous transaction stream, the duplicate data copy 
requirement reduces the BW by a factor of 2, resulting in no change in the BW. 
However, if transactions occur in bursts with delays between bursts, the pri¬ 
mary copy can be written at twice the BW during the burst, and the backup 
copy can be performed during the pauses between bursts. Thus the doubling of 
BW can be realized under those circumstances. Since memory systems repre¬ 
sent 40%-60% of the cost of computer systems [Gibson, 1992, pp. 50-51], the 
use of mirrored disks greatly increases the cost of a computer system. Also, 
since the reliability is that of a parallel system, the reliability function is given 
by Eq. (3.8) and the MTTF by Eq. (3.26) for constant disk failure rates. If both 
disks are identical and have the same failure rates, the MTTF of the mirrored 
disks becomes 1.5 times greater than that of a single disk system. The Tan¬ 
dem line of “Nonstop” computers (discussed in Section 3.10.1) are essentially 
mirrored disks with the addition of duplicate computers, disk controllers, and 
I/O buses. The RAIDbook [1995, p. 45] calls the Tandem configuration a fully 
redundant system. 

RAID systems of Level 2 and higher all have at least one hot spare disk. 
When a disk error is detected via an error-detecting code or other form of built- 
in disk monitoring, the disk system takes the remaining stored and redundant 
information and reconstructs a valid copy on the hot disk, which is switched-in, 
instead of the failed disk. Sometime later during maintenance, the failed disk 
is repaired or replaced. The differences among the following RAID levels are 
determined by the means of error detection, the size of the chunk that has 
associated error checking, and the pattern of striping. 


3.9.4 RAID Level 2 

This level of RAID introduces Hamming error-correcting codes similar to those 
discussed in Chapter 2 to detect and correct data errors. The error-correcting 
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codes are added to the “chunks” of data and striped across the disks. In general, 
this level of RAID employs a SECSED code or a SECDED code such as one 
described in Chapter 2. The code is applied to data blocks, and the disk spindles 
must be synchronized. One can roughly compare the reliability of this scheme 
with a Level 1 system. For the Level 1 RAID system to fail, both disks must 
fail, and the probability of failure is 


Pfi = q 2 (3-74) 

For a Level 2 system to fail, one of the two disks must fail that has a prob¬ 
ability of 2 q, and the Hamming code must fail to detect an error. The example 
used in the The RAIDhook [1995] to discuss a Level 2 system is for ten data 
disks and four check disks, representing a 40% cost overhead for redundancy 
compared with a 100% overhead for a Level 1 system. In Chapter 2, we com¬ 
puted the probability of undetected error for eight data bits and four check bits 
in Eq. (2.27) and shall use these results to estimate the probability of failure 
of a typical Level 2 system. For this example, 

P f 2 = (2q)x[220q\\-q) 9 \ (3.75) 

Clearly, for very small q, the Level 2 system has a smaller probability of failure. 
The two equations—(3.74) and (3.75)—are approximately equal for q = 0.064, 
at which level the probability of failure is 0.064 2 = 0.00041. 

To appreciate how this level would apply to a typical disk, let us assume 
that the MTTF for a typical disk is 300,000 hours. Assuming a constant failure- 
rate model, X = 1 /300,000 =3.3 x 10 6 . The associated probability of failure 
for a single disk would be 1 - exp(— 3.3 * 10 6 t), and setting this expression 
to 0.064 shows that a single disk reaches this probability of failure at about 
20,000 hours. Since a year is roughly 10,000 hours (8,766), a mirrored disk 
system would be superior for a few years of operation. A detailed reliability 
comparison would require a prior design of a Level 2 system with the appro¬ 
priate number of disks, choice of chunk level (bit, byte, block, etc.), inclusion 
of a swapping disk, disk striping, and other details. 

Detailed design of a Level 2 system such a disk system leads to nonstandard 
disks, significantly raising the cost of the system, and the technique is seldom 
used in practice. 

3.9.5 RAID Levels 3, 4, and 5 

In Chapter 2, we discussed the fact that a single parity bit is an inexpensive 
and fairly effective way of significantly increasing reliability. Levels 3, 4, and 
5 apply such a parity-bit code to different size data “chunks” in various ways to 
increase the reliability of a disk array at a lower cost than a mirrored disk. We 
will model the reliability of a Level 3 system as an example. A disk can fail in 
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Figure 3.17 A common mapping for a RAID Level 3 array [adapted from Fig. 48, 
The RAIDbook, 1995]. 


several ways: two are a surface failure (where stored bits are corrupted) and an 
actuator, head, or spindle failure (where the entire disk does not work—total 
failure). We assume that disk optimization software that periodically locks out 
bad bits on a disk generally protects against local surface failures, and the main 
category of failures requiring fault tolerance are total failures. 

Normally, a single parity bit will provide an error-detecting but not an error- 
correcting code; however, the availability of parity checks for more than one 
group of strips provides error-correcting ability. Consider the typical example 
shown in Fig. 3.17. The parity disk computes a parity copy for strips (0-3) 
and (4-7) using the EXOR function: 

P( 0-3) = strip 0 © strip 1 © strip 2 © strip 3 (3.76) 

P(4-7) = strip 4 © strip 5 © strip 6 © strip 7 (3.77) 

Assume that there is a failure of disk 2, corrupting the data on strip 1 and 
strip 5. To regenerate the data on strip 1, we compute the EXOR of P(0-3) 

along with strip 0, strip 2, and strip 3 that are on unfailed disks 5, 1, 3, 4. 

REGEN(l) = P(0-3) © strip 0 © strip 2 © strip 3 (3.78a) 

and substitution of Eq. (3.76) into Eq. (3.78a) yields 
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1 - X'A t 1 - (X + n')At 1 

D 

s 0 = N + 1 good disks, = N good disks, s 2 = N- 1 or fewer 

good disks 

Figure 3.18 A Markov model for N + 1 disks protected by a single parity disk. 



REGEN(l) = (strip 0 © strip 1 © strip 2 © strip 3) 

© (strip 0 © strip 2 © strip 3) (3.78b) 

Since strip 0 © strip 0 = 0, and similarly for strip 2 and strip 3, Eq. (3.78b) 
results in the regeneration of strip 1. 

REGEN(l) = strip 1 (3.78c) 

The conclusion is that we can regenerate the information on strip 1, which 
was on the catastrophically failed disk 2 from the other unfailed disks. Clearly 
one could recover the other data for strip 5, which is also on failed disk 2 in a 
similar manner. These recovery procedures generalize to other Level 3, 4, and 
5 recovery procedures. Allocating data to strips is called stripping. 

A Level 3 system has N data disks that store the system data and one parity 
disk that stores the error-detection data for a total of N + 1 disks. The system 
succeeds if there are zero failures or one disk failure, since the damaged strips 
can be regenerated (repaired) using the above procedures. A Markov model 
for such operation is shown in Fig. 3.18. The solution follows the same path 
as that of Fig 3.14, and the same solution can be used if we set X' = (N + 
1)A, X = NX, and gf = g. Substitution of these values into Eqs. (3.65a, b) and 
adding these probabilities yields the reliability function. Substitution into Eq. 
(3.66) yields the MTTF: 

MTTF = [AX + g + (N + l)A]/[iVA(iV + 1)A] (3.79a) 

MTTF = [(2A + 1)A + g\/[N(N + 1)A 2 ] (3.79b) 

These equations check with the model given in Gibson [1992, pp. 137-139]. 
In most cases, g » X, and the MTTF expression given in Eq. (3.79b) becomes 
MTTF = g/[N(N + 1)A 2 ]. If the recovery time were 1 hour, N = 4 as in the 

design of Fig. 3.17, and X = 1/300,000 as previously assumed, then MTTF = 

4.5 x 10 9 . Clearly, the recovery built into this example makes the loss of data 
very improbable. A comprehensive analysis would include the investigation of 
other possible modes of failure, common mode failures, and so forth. If one 
wishes to compute the availability of a RAID Level 3 system, a model similar 
to that given in Fig. 3.16 can be used. 
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3.9.6 RAID Level 6 

There are several choices for establishing two independent parity checks. One 
approach is a horizontal-vertical parity approach. A parity bit for a string is 
computed in two different ways. Several rows of strings are written, from 
which a set of horizontal parity bits are computed for each row and a set of 
vertical parity bits are computed for each column. Actually, this description is 
just one approach to Level 6; any technique that independently computes two 
parity bits is classified as Level 6 (e.g., applying parity to two sets of bits, using 
two different algorithms for computing parity, and Reed-Solomon codes). For 
more comprehensive analysis of RAID systems, see Gibson [1992]. A com¬ 
parison of the various RAID levels was given in Table 3.6, on page 121. 


3.10 TYPICAL COMMERCIAL FAULT-TOLERANT SYSTEMS: 
TANDEM AND STRATUS 

3.10.1 Tandem Systems 

In the 1980s, Tandem captured a significant share of the business market with 
its “NonStop” computer systems. The name was a great asset, since it captured 
the aspirations of many customers in the on-line transaction processing market 
who required high reliability, such as banks, airlines, and financial institutions. 
Since 1997, Tandem Computers has been owned by the Compaq Computer Cor¬ 
poration, and it still stresses fault-tolerant computer systems. A 1999 survey esti¬ 
mates that 66% of credit card transactions, 95% of securities transactions, and 
80% of automated teller machine transactions are processed by Tandem com¬ 
puters (now called NonStop Himalaya computers). “As late as 1985 it was esti¬ 
mated that a conventional, well-managed, transaction-processing system failed 
about once every two weeks for about an hour” [Siewiorek, 1992, p. 586]. Since 
there are 168 hours in a week, substitution into the basic steady-state equation 
for availability Eq. (B95a) yields an availability of 0.997. (Remember that X = 
1/MTTF and p. = 1/MTTR for constant failure and repair rates.) To appreciate 
how mediocre such an availability is for a high-reliability system, let us consider 
the availability of an automobile. Typically an auto may require one repair per 
year (sometimes referred to as nonscheduled maintenance to eliminate inclusion 
of scheduled maintenance, such as oil changes, tire replacements, and new spark 
plugs), which takes one day (drop-off to pickup time). The repair rate becomes 
1 per day; the failure rate, 1/365 per day. Substitution into Eq. (B95a) yields a 
steady-state availability of 0.99726—nearly identical to our computer computa¬ 
tion. Clearly, a highly reliable computer system should have a much better avail¬ 
ability than a car! Tandem’s original goal was to build a system with an MTTF 
of 100 years! There was clearly much to do to improve the availability in terms 
of increasing the MTTF, decreasing the MTTR, and structuring a system config¬ 
uration with greatly increased reliability and availability. Suppose one chooses a 
goal of 1 hour for repair. This may be realistic for repairs such as board-swapping, 
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but suppose the replacement part is not available? If we assume that 1 hour repre¬ 
sents 90% of the repairs but that 10% of the repairs require a replacement part that 
is unavailable and must be obtained by overnight mail (24 hours), the weighted 
repair time is then (0.9 x 1+0.1 x 24) =3.3 hours. Clearly, the MTTR will depend 
on the distribution of failure modes, the stock of spare parts on hand, and the effi¬ 
ciency of ordering procedures for spare parts that must be ordered from the manu¬ 
facturer. If one were to achieve an MTTF of 100 years and an MTTR of 3.3 hours, 
the availability given by Eq. (B95) would be an impressive 0.999996. 

The design objectives of Tandem computers were the following [Anderson, 
1985]: 

• No single hardware failure should stop the system. 

• Hardware elements should be maintainable with the on-line system. 

• Database integrity should be ensured. 

• The system should be modularly extensible without incurring application 
software changes. 

The last objective, extensibility of the system without a software change, 
played a role in Tandem’s success. The software allowed the system to grow by 
adding new pairs of Tandem computers while the operation continued. Many 
of Tandem’s competitors required that the system be brought down for system 
expansion, that new software and hardware be installed, and that the expanded 
system be regenerated. 

The original Tandem system was a combination of hardware and software 
fault tolerance. (The author thanks Dr. Alan R Wood of Compaq Corporation 
for his help in clarifying the Tandem architecture and providing details for this 
section [Wood, 2001].) Each major hardware subsystem (CPUs, disks, power 
supplies, controllers, and so forth) was (and still is) implemented with parallel 
units continuously operating (hot redundancy). A diagram depicting the Tan¬ 
dem architecture is shown in Fig. 3.19. The architecture supports A processors 
in which N is at an even number between 2 and 16. 

The Tandem processor subsystem uses hardware fault detection and soft¬ 
ware fault tolerance to recover from processor failures. The Tandem operating 
system called Guardian creates and manages heartbeat signals, saying “I’m 
alive,” which each processor sends to all the other processors every second. If 
a processor has not received a heartbeat signal from another processor within 
two seconds, each operating processor enters a system state called regroup. The 
regroup algorithm determines the hardware element(s) that has failed (which 
could be a processor or the communications between a group of processors, or 
it could be multiple failures) and also determines which system resources are 
still available, avoiding bisection of the system, called the split-brain condi¬ 
tion, in which communications are lost between two processor groups and each 
group tries to continue on its own. At the end of the regroup, each processor 
knows the available system resources. 
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Tandem Architecture 



I/O device 


Figure 3.19 Basic architecture of a Tandem NonStop computer system. [Reprinted 
with permission of Compaq Computer Corporation.] 


The original Tandem systems used custom microprocessors and checking 
logic to detect hardware faults. If a hardware fault was detected, the processor 
would stop sending output (including the heartbeat signal), causing the remain¬ 
ing processors to regroup. Software fault tolerance is implemented via process 
pairs using the Tandem Guardian operating system. A process pair consists 
of a primary and a backup process running in separate processors. If the pri¬ 
mary process fails because of a software defect or processor hardware failure, 
the backup process assumes all the duties of the primary process. While the 
primary process is running, it sends checkpoint messages to the backup pro¬ 
cess for ensuring that the backup process has all the process state information 
it needs to assume responsibility in case of a failure. When a processor fail¬ 
ure is detected, the backup processes for all the processes that were running 
in that processor take over, using the process state from the last checkpoint 
and reexecuting any operations that were pending at the time of the failure. 
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Since checkpointing requires very little processing, the “backup” processor is 
actually the primary processor for many tasks. In other words, all Tandem pro¬ 
cessors spend most of their time processing transactions; only a small fraction 
of their time is spent doing backup processing to protect against a failure. 

In the Tandem system, hardware fault tolerance consists of multiple proces¬ 
sors performing the same operations and determining the correct output by using 
either comparative or self-checking logic. The redundant processors serve as 
standbys for the primary processor and do not perform additional useful work. If 
a single processor fails, a redundant processor continues to operate, which pre¬ 
vents an outage. The process pairs in the Tandem system provide software fault 
tolerance and, like hardware fault tolerance, provide the ability to recover from 
single hardware failures. Unlike hardware fault tolerance, however, they pro¬ 
tect against transient software failures because the backup process reexecutes an 
operation rather than simultaneously performing the same operation. 

The K-series NonStop Himalaya computers released by Tandem in 1992 oper¬ 
ate under the same basic principles as the original machines. However, they use 
commercial microprocessors instead of custom-designed microprocessors. Since 
commercial microprocessors do not have the custom fault-detection capabilities 
of custom-designed microprocessors. Tandem had to develop a new architec¬ 
ture to ensure data integrity. Each NonStop Himalaya processor contains two 
microprocessor chips. These microprocessors are lock-stepped—that is, they run 
exactly the same instruction stream. The output from the two microprocessors 
is compared; if it should ever differ, the processor output is frozen within a few 
nanoseconds so that the corrupted data cannot propagate. The output compari¬ 
son provides the processor fault detection. The takeover is still managed by pro¬ 
cess pairs using the Tandem operating system, which is now called the NonStop 
Kernel. 

The S-series NonStop Himalaya servers released in 1997 provided new 
architectural features. The processor and I/O buses were replaced with a net¬ 
work architecture called ServerNet (see Fig. 3.20). The network architecture 
allows any device controller to serve as the backup for any other device con¬ 
troller. ServerNet incorporates a number of data integrity and fault-isolation 
features, such as a 32-bit cyclic redundancy check (CRC) [Siewiorek, 1992, pp. 
120-123], on all data packets and automatic low-level link error detection. It 
also provides the interconnect for NonStop Himalaya servers to move beyond 
the 16-processor node limit using an architecture called ServerNet Clusters. 
Another feature of NonStop Himalaya servers is that all hardware replacements 
and reconfigurations can be done without interrupting system operations. The 
database can be reconfigured and some software patches can be installed with¬ 
out interrupting system operations as well. 

The S-series line incorporates many additional fault-tolerant features. The 
power and cooling systems are redundant and derated so that a single power 
supply or fan has sufficient capability to power or cool an entire cabinet. The 
speed of the remaining fans automatically increases to maintain cooling if any fan 
should fail. Temperature and voltage levels at key points are continuously mon- 
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ServerNet 


Figure 3.20 S-Series NonStop Himalaya architecture. (Supplied courtesy of Wood 
[ 2001 ].) 


itored, and alarms are sounded whenever the levels exceed safe thresholds. Bat¬ 
tery backup is provided to continue operation through any short-duration power 
outages (up to 30 seconds) and to preserve the contents of memory to provide a 
fast restart from outages shorter than 2 hours. (If it is necessary to protect against 
longer power outages, the common solution for high-availability systems is to 
provide a power supply with backup storage batteries plus DC-AC converters 
and diesel generators to recharge the batteries. The superior procedure is to have 
autostart generators, which automatically start when a power outage is detected; 
however, they must be tested—perhaps once a week—to see if they will start.) 
All controllers are redundant and dual-ported to serve the primary and secondary 
connection paths. Each hardware and software module is self-checking and halts 
immediately instead of permitting an error to propagate—a concept known as the 
fail-fast design, which makes it possible to determine the source of errors and cor¬ 
rect them. NonStop systems incorporate state-of-the-art memory-detection and 
-correction codes to correct single-bit errors, detect double-bit errors, and detect 
“nibble” errors (3 or 4 bits in a row). Tandem has modified the memory vendor’s 
error-correcting code (ECC) to include address bits, which helps avoid the read¬ 
ing from or writing to the wrong block of memory. Active techniques are used to 
check for latent faults. A background memory “sniffer” checks the entire mem¬ 
ory every few hours. 

System data is protected in many ways. The multiple data paths provided 
for fault tolerance are alternately used to ensure correct operation. Data on 
all the buses is parity-protected, and parity errors cause immediate interrupts 
to trigger error recovery. Disk-driver software provides an end-to-end check¬ 
sum that is appended to a standard 512-byte disk sector. For structured data, 
such as SQL tiles, an additional end-to-end checksum (called a block check¬ 
sum) encodes data values, the physical location of the data, and transaction 
information. These checksums protect against corrupted data values, partial 
writes, and misplaced or misaligned data. NonStop systems can use the Non- 
Stop remote duplicate database facility (NonStop RDF) to help recover from 
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disasters such as earthquakes, hurricanes, fires, and floods. NonStop RDF sends 
database updates to a remote site up to thousands of miles away. If a disas¬ 
ter occurs, the remote system takes over within a few minutes without losing 
any transactions. NonStop Flimalaya servers are even “coffee fault-tolerant,” 
meaning the air vents are on the sides to protect against coffee spills on top of 
the processor cabinet (or, more likely, if the sprinkler system in the computer 
room is triggered). One would hope that Tandem has also thought about pro¬ 
tection against failure modes caused by inadvertant operator errors. Tandem 
plans to use the alpha microprocessor sometime in the future. 

To analyze the Tandem fault-tolerant system, one would formulate a Markov 
model and proceed as was done previously in this chapter (but for more detail, 
consult Chapter 4). One must also anticipate the possibilities of errors of com¬ 
mission and omission in generating and detecting the heartbeat signals. This 
could be modeled by a coverage factor representing the fraction of proces¬ 
sor faults that the heartbeat signal would diagnose. (This basic approach is 
explored in the problems at the end of this chapter.) In Chapter 4, the avail¬ 
ability formulas are derived for a parallel system to compare with the avail¬ 
ability of a voting system [see Eq. (4.48) and Table 4.9]. Typical computations 
at the end of Section 4.9.2 for a parallel system apply to the Tandem system. 
A complete analysis would require the use of a Markov modeling program and 
multiple models that include more detail and fault-tolerant features. 

The original Guardian operating system was responsible for creating, destroy¬ 
ing, and monitoring processes, reporting on the failure or restoration of proces¬ 
sors, and handling the conventional functions of operating systems in addition to 
multiprogramming system functions and I/O handling. The early Guardian sys¬ 
tem required the user to exactingly program the checkpointing, the record lock¬ 
ing, and other functions. Thus expert programmers were needed for these tasks, 
which were often slow in addition to exacting. To avoid such problems, Tandem 
developed two simpler software systems: the terminal control program (TCP) 
called Pathway, which provided users with a control program having screen- 
handling modules written in a higher level (COBOL-like) language to issue 
checkpoints and dealt with process management and processor failure; and the 
transaction-monitoring facility (TMF) program, which dealt with the consistency 
and recoverability of the database and provided concurrence control. The new 
Himalaya software greatly simplifies such programming, and it provides options 
to increase throughput. It also supports Tuxedo, Corba, and Java to allow users to 
write to industry-standard interfaces and still get the benefits of fault tolerance. 
For further details, see Anderson [1985], Baker [1995], Siewiorek [1992, p. 586], 
Wood [1995], and the Tandem Web site: [http://himalaya.compaq.com]. Also, 
see the discussion in Chapter 5, Section 5.10. 

3.10.2 Stratus Systems 

The Stratus line of continuous processing systems is designed to provide unin¬ 
terrupted operation without loss of data and performance degradation, as well 
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Figure 3.21 Basic Stratus architecture. [Reprinted with permission of Stratus Com¬ 
puter.] 


as without special application programming. In 1999, Stratus was acquired by 
Investcorp, but it continues its operation as Stratus Computers. Stratus’s cus¬ 
tomers include major credit card companies, 4 of the 6 U.S. regional securi¬ 
ties exchanges, the largest stock exchange in Asia, 15 of the world’s 20 top 
banks, 9-1-1 emergency services, and others. (The author thanks Larry Sher¬ 
man of Stratus Computers for providing additional information about Stratus.) 
The Stratus system uses the basic architecture shown in Fig. 3.21. Compari¬ 
son with the Tandem system architecture shown in Fig. 3.19 shows that both 
systems have duplicated CPUs, I/O and memory controllers, disk controllers, 
communication controllers, and high-speed buses. In addition, power supplies 
and other buses are duplicated. 
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The Stratus lockstepped microprocessor architecture appears similar to the 
Tandem architecture described in the previous section, but fault tolerance 
is achieved through different mechanisms. The Stratus architecture is hard¬ 
ware fault-tolerant, with four microprocessors (all running the same instruction 
stream) configured as redundant pairs of physical processors. Processor failure 
is detected by a microprocessor miscompare, and the redundant processor (pair 
of microprocessors) continues processing with no takeover time. The Tandem 
architecture is software fault-tolerant; although failure of a processor is also 
detected by a microprocessor miscomparison, takeover is managed by software 
requiring a few seconds’ delay. 

To summarize the comparison, the Tandem system is more complex, higher 
in cost, and aimed at the upper end of the market. The Stratus system, on the 
other hand, is more simple, lower in cost, and competes in the middle and 
lower end portion of the market. 

Each major Stratus circuit board has self-checking hardware that contin¬ 
uously monitors operation, and if the checks fail, the circuit board removes 
itself from service. In addition, each CPU board has two or more CPUs that 
process the same data, and the outputs are compared at each clock cycle. If 
the comparison fails, the CPU board removes itself from service and its twin 
board continues processing without stop. Stratus calls the architecture with two 
CPUs being checked pair and spare, and claims that its architecture is superior 
in detecting transient errors, is lower in cost, and does not require intensive 
programming. Tandem points out that software fault tolerance also protects 
against software faults (90% of all software faults are transient); note, how¬ 
ever, that there is the small possibility of missed or imagined software errors. 
The Stratus approach requires a dedicated backup processor, whereas the Tan¬ 
dem system can use the backup processor in a two-processor configuration to 
do “useful work” before a failure occurs. 

For a further description of the pair-and-spare architecture, consider logical 
processor A and S. As previously discussed in the case of Tandem, logical 
processor A is composed of lockstepped microprocesors A j and A 2 and logical 
processor B is composed of lockstepped microprocessors B\ and B 2 . Processors 
A 1 and A 2 compare outputs and will lock out processor A if there is disagree¬ 
ment. A similar comparison is made for processor B, as lockout of processor B 
occurs if processors B\ and B 2 disagree. The basic mode of failure is if there 
is a failure of one processor from logical A and one processor from logical 
B. The outputs of logical processors A and B are not further checked and are 
ORED on the output bus. Thus, if a very rare failure mode occurs where both 
processors A] and At fail in the same manner and if both have the same wrong 
output, the comparitor would be fooled, the faulty output of logical processor 
A would be ORED with the correct output of logical processor B, and wrong 
results would appear on the output bus. Because of symmetry, identical failures 
of Si and St would also pass the comparitor and corrupt the output. Although 
these two failure modes would be rare, they should be included and evaluated 
in a detailed analysis. 
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Recovery of partially completed transactions is performed by software using 
the Stratus virtual operating system (VOS) and the transaction protection facil¬ 
ity (TPF). The latest Stratus servers also support Microsoft Windows 2000 
operating systems. The Stratus Continuum 400 systems are based on the 
Hewlett-Packard (HP) PA-RISC microprocessor family and run a version of 
the HP-UX operating system. 

The system can be expanded vertically by adding more processor boards or 
horizontally via the StrataLINK. The StrataLINK will connect modules within 
a building or miles away if extenders are used. Networking allows distributed 
processing at remote distances under control of the VOS: one module could 
run a program, another could acess a hie, and a third could print the results. To 
shorten repair time, a board taken out of service is self-tested to determine if it 
is a transient or permanent fault. In the former case, the system automatically 
returns the board to service. In the case of a permanent failure, however, the 
customer assistance center can immediately ship replacement parts or aid in the 
diagnosis of problems by means of a secured, built-in communications link. 

Stratus claims that its systems have about five minutes of downtime per 
year. One can relate this statistic to availability if we start with Eq. (4.53), 
which was derived for a single element; however, in this case the element is 
a system. Repair rates are related to the amount of downtime in an interval 
and failure rates to the amount of uptime in an interval. For convenience, we 
let the interval be one year and denote the average uptime by U and the aver¬ 
age downtime by D. The repair rate, in repairs per year, is the reciprocal of 
the years per repair, which is the downtime per year; thus, pt = 1/D. Similar 
reasoning leads to a relationship for the failure rate, X = 1 /U. Substituting the 
above expressions for X and pi into Eq. (B95a) yields (also see Section 1.3.4): 
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1 
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U 

U + D 


(3.80) 


Since a year contains 8,766 hours, and 5 minutes of downtime is 5/60 of an 
hour, we can substitute in Eq. (3.80) and obtain 

8,766- J- 

=- q = 0.9999905 (3.81) 

8,766 

Stratus calls this result a “five-nines availability.” The quoted value is 
slightly less than the Bell Labs’ ESS No. 1A goal of 2 hours downtime in 
40 years (which yields an availability of 0.9999943) and is equivalent to 3 
minutes of downtime per year (see Section 1.3.5). Of course, it is easier to 
compare the unavailability, A = 1- A, of such highly reliable systems. Thus 
ESS No. 1 had an unavailability goal of 57 x 10 7 , and Stratus claims that it 
achieves an unavailability of 95 x 10 7 , which is (5/3) larger. The availability 
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formulation given in Eq. (3.80) is often used to estimate availability based on 
measured up- and downtimes. For more details on the derivation, see Shooman 
[1990, pp. 358-359]. 

To analyze such a system, one would formulate a Markov model and pro¬ 
ceed as was done in this chapter and also in Chapter 4. One must also anti¬ 
cipate the possibilities of errors of commission and omission in the hardware 
comparisons of the various processors. This could be modeled by a coverage 
factor representing the fraction of processor faults that go undetected by the 
comparison logic. This basic approach is explored in the problems at the end 
of this chapter. 

Considerable effort must be expended during the design of a high-availabil¬ 
ity computer system to decrease the mean time between repairs and increase 
the mean time between failures. Stratus provides a number of diagnostic LEDs 
(light-emitting diodes) to aid in diagnosis and repair. The status of various sub¬ 
systems is indicated by green, red, and sometimes amber lights (there may also 
be flashing red lights). Also, considerable emphasis is given to the power sup¬ 
ply. Manufacturers of high-reliability equipment know that the power supply 
of a computer system is sometimes an overlooked feature but that it is of great 
importance. During the late 1960s, the SABRE airlines reservation system was 
one of the first large-scale multilocation transaction systems. During the early 
stages of operation, many of the system outages were caused by power supply 
problems [Shooman, 1983, p. 502]. As was previously stated, power supplies 
for such large critical installations as air traffic control and nuclear plant con¬ 
trol are dual systems with a local power company as a primary supply backed 
up by storage batteries with DC-AC converters and diesel generators as a third 
line of defense. Small details must be attended to, such as running the diesel 
generators for a few minutes a week to ensure that they will start in an emer¬ 
gency. The Stratus power supply system contains three or four power supply 
units as well as backup batteries and battery-temperature monitoring. The bat¬ 
teries have sufficient load capacity to power the system for up to four minutes, 
which is sufficient for one minute of operation during a power fluctuation plus 
time for safe shutdown, or four consecutive outages of less than one minute 
without time to recharge the batteries. Clearly, long power outages will bring 
down the system unless there are backup batteries and generators. High battery 
temperature and low battery voltage are monitored. To increase the MTTF of 
the fan system (and to reduce acoustic noise), fans are normally run at two- 
thirds speed, and in the case of overtemperature, failures, or other warning 
conditions, they increase to full speed to enhance cooling. 

For more details on Stratus systems, see Anderson [1985], Siewiorek [1992, 
p. 648], and the Stratus Web site: [http://www.stratus.com]. 

3.10.3 Clusters 

In general, the term cluster refers to a group of off-the-shelf computers orga¬ 
nized by software to serve a specific purpose requiring very large computing 
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power or high availability, fault tolerance, and on-line repairability. We are of 
course interested in the latter application of clustering; however, we should first 
cite two historic achievements of clusters designed for the former application 
class [Hennessy, 1998, pp. 734-736], 

• In 1997, the IBM SP2 computer, a cluster of 32 IBM nodes similar to 
the RS/6000 workstation with added hardware accelerators for chessboard 
evaluation, beat the then-reigning world chess champion Gary Kasparov 
in a human-machine contest. 

• A cluster of 100 Sun UltraSPARC computers at the University of 
California-Berkeley, connected by 160 MB/sec Myrinet switches, set two 
world records: (a), 8.6 gigabytes of data stored on disk was sorted in 1 
minute; and (b), a 40-bit DES key encrypted message was cracked in 3.5 
hours. 

Fault-tolerant applications of clusters involve a different architecture. The 
simplest scheme is to have two computers: one that is processing on-line and 
the other that is operating in standby. If the operating system senses a fail¬ 
ure of the on-line computer, a recovery procedure is started to bring the sec¬ 
ond computer on line. Unfortunately, such an architecture results in downtime 
during the recovery period, which may be either satisfactory or unsatisfactory 
depending on the application. For a university-computing center, downtime is 
acceptable as long as it is minimal, but even a small amount of downtime 
would be inadequate for electronic funds transfer. A superior procedure is to 
have facilities in the operating system that allow transfer from the on-line to 
the standby computer without the system going down and without the loss of 
information. The Tandem system can be considered a cluster, and some of the 
VAX clusters in the 1980s were very popular. 

As an example, we will discuss the hardware and Solaris operating-system 
features used by a Sun cluster [www.sun.com, 2000]. Some of the incorporated 
fault-tolerant features are the following: 

• Error-correcting codes are used on all memories and caches. 

• RAID controllers. 

• Redundant power supplies and cooling fans, each with overcapacity. 

• The system can lock out bad components during operation or when the 
server is rebooted. 

• The Solaris 8 operating system has error-capture capabilities, and more 
such capabilities will be included in future releases. 

• The Solaris 8 operating system provides recovery with a reboot, though 
outages occur. 

• The Sun Cluster 2.2 software, which is an add-on to the Solaris system, 
will handle up to four nodes, providing networking and fiber-channel inter- 
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connections as well as some form of nonstop processing when failures 
occur. 

• The Sun Cluster 3.0 software, released in 2000, will improve on Sun Clus¬ 
ter 2.2 by increasing the number of nodes and simplifying the software. 

It seems that the Sun Cluster software is now beginning to develop fault-tol¬ 
erant features that have been available for many years in the Tandem systems. 
For a comprehensive discussion of clusters, see Phster [1995]. 
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PROBLEMS 

3.1. Assume that a system consists of five series elements. Each of the elements 
has the same reliability p, and the system goal is R s = 0.9. Find p. 

3.2. Assume that a system consists of five series elements. Three of the ele¬ 
ments have the same reliability p, and two have known reliabilities of 
0.95 and 0.97. The system goal is R s = 0.9. Find p. 

3.3. Assume that a system consists of five series elements. The initial reli¬ 
ability of all the elements is 0.9, each costing $1,000. All components 
must be improved so that they have a lower failure rate for the sys¬ 
tem to meet its goal of R s = 0.9. Suppose that for three of the elements, 
each 50% reduction in failure probability adds $200 to the element cost; 
for the other two components, each 50% reduction in failure probability 
adds $300 to the element cost. Find the lowest cost system that meets 
the system goal of R s = 0.9. 
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3.4. Would it be cheaper to use component redundancy for some or all of the 
elements in problem 3.3? Explain. Give the lowest cost system design. 

3.5. Compute the reliability of the system given in problem 3.1, assuming 
that one is to use 

(a) System reliability for all elements. 

(b) Component reliability for all elements. 

(c) Component reliability for selected elements. 

3.6. Compute the reliability of the system given in problem 3.2, assuming 
that one is to use 

(a) System reliability for all elements. 

(b) Component reliability for all elements. 

(c) Component reliability for selected elements. 

3.7. Verify the curves for m = 3 for Fig. 3.4. 

3.8. Verify the curves for Fig. 3.5. 

3.9. Plot the system reliability versus K (0 < K < 2) for Eqs. (3.13) and 
(3.15). 

3.10. Verify that Eq. (3.16) leads to the solution Kp = 0.9772 for;? = 0.9. 

3.11. Find the solution for problem 3.10 corresponding to p = 0.95. 

3.12. Use the approximate exponential expansion method discussed in Section 
3.4.1 to compute an approximate reliability expression for the systems 
shown in Figs. 3.3(a) and 3.3(b). Use these expressions to compare the 
reliability of the two configurations. 

3.13. Repeat problem 3.12 for the systems of Fig. 3.6(a) and 3.6(b). Are you 
able to verify the result given in problem 3.10 using these equations? 
Explain. 

3.14. Compute the system hazard function as discussed in Section 3.4.2 for 
the systems of Fig. 3.3(a) and Fig. 3.3(b). Do these expressions allow 
you to compare the reliability of the two configurations? 

3.15. Repeat problem 3.14 for the systems of Fig. 3.6(a) and 3.6(b). Are you 
able to verify the result given in problem 3.10 using these equations? 
Explain. 

3.16. The mean time to failure, MTTF, is defined as the mean (expected value, 
first moment) of the time to failure distribution [density function /(f)]. 
Thus, the basic definition is 


MTTF = 


1 


t=o 


tf(t ) dr 
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Using integration by parts, show that this expression reduces to Eq. 
(3.24). 

3.17. Compute the MTTF for Fig. 3.2(a)-(c) and compare. 

3.18. Compute the MTTF for 

(a) Fig. 3.3(a) and (b). 

(b) Fig. 3.6(a) and (b). 

(c) Fig. 3.8. 

(d) Eq. (3.40). 

3.19. Sometimes a component may have more than one failure state. For 
example, consider a diode that has 3 states: good, X \; failed as an open 
circuit, x 0 \ failed as a short circuit, x s ; 

(a) Make an RBD model. 

(b) Write the reliability equation for a single diode. 

(c) Write the reliability equation for two diodes in series. 

(d) Write the reliability equation for two diodes in parallel. 

(e) If the P(x i)= 0.9, P (x 0 ) = 0.07, P(x s ) = 0.03, calculate the reliability 
for parts (b), (c), and (d). 

3.20. Suppose that in problem 3.19 you had only made a two-state 
model—diode either good or bad, P(x g ) = 0.9, P(xj,) = 0.1. Would the 
reliabilities of the three systems have been the same? Explain. 

3.21. A mechanical component, such as a valve, can have two modes of fail¬ 
ure: leaking and blocked. Can we treat this with a three-state model as 
we did in problem 3.19? Explain. 

3.22. It is generally difficult to set up a reliability model for a system with 
common mode failures. Oftentimes, making a three-state model will 
help. Suppose xi denotes element 1 that is good, x, denotes element 
1 that has failed in a common mode, and x,- denotes element 1 that 
has failed in an independent mode. Set up reliability models and equa¬ 
tions for a single element, two series elements, and two parallel elements 
based on the one success and two failures modes. Given the probabili¬ 
ties P(xi) = 0.9, P(x c ) =0.03, P(Xj) = 0.07, evaluate the reliabilities of 
the three systems. 

3.23. Suppose we made a two-state model for problem 3.22 in which the ele¬ 
ment was either good or bad, P(x\) = 0.9, P(x\) = 0.10. Would the reli¬ 
abilities of the single element, two in series, and two in parallel be the 
same as computed in problem 3.22? 

3.24. Show that the sum of Eqs. (3.65a-c) is unity in the time domain. Is this 
result correct? Explain why. 

3.25. Make a model of a standby system with one on-line element and two 
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standby elements, all with identical failure rates. Formulate the Markov 
model, write the equations, and solve for the reliability. 

3.26. Compute the MTTF for problem 3.25. 

3.27. Extend the model of Fig. 3.11 to n states. If all the transition probabilities 
are equal, show that the state probabilities follow the Poisson distribu¬ 
tion. (This is one way of deriving the Poisson distribution.). Flint: use 
of Laplace transforms helps in the derivation. 

3.28. Compute the MTTF for problem 3.27. 

3.29. Compute the reliability of a two-element standby system with unequal 
on-line failure rates for the two components. Modify Fig. 3.11. 

3.30. Compute the MTTF for problem 3.29. 

3.31. Compute the reliability of a two-element standby system with equal on¬ 
line failure rates and a nonzero standby failure rate. 

3.32. Compute the MTTF for problem 3.31. 

3.33. Verify Fig. 3.13. 

3.34. Plot a figure similar to Fig. 3.13, where Eq. (3.60) replaces Eq. (3.58). 
Under what conditions are the parallel and standby systems now approx¬ 
imately equal? Compare with Fig. 3.13 and comment. 

3.35. Reformulate the Markov model of Fig. 3.14 for two nonidentical parallel 
elements with one repairman; then write the equations and solve for the 
reliability. 

3.36. Compute the MTTF for problem 3.35. 

3.37. Reformulate the Markov model of Fig. 3.14 for two identical parallel 
elements with one repairman and a nonzero standby failure rate. Write 
the equations and solve for the reliability. 

3.38. Compute the MTTF for problem 3.37. 

3.39. Compute the reliability of a two-element standby system with unequal 
on-line failure rates for the two components. Include coverage. Modify 
Fig. 3.11 and Fig. 3.15. 

3.40. Compute the MTTF for problem 3.39. 

3.41. Compute the reliability of a two-element standby system with equal on¬ 
line and a nonzero standby failure rate. Include coverage. 

3.42. Compute the MTTF for problem 3.1. 

3.43. Plot a figure similar to Fig. 3.13 where we compare the effect of cov¬ 
erage (rather than an imperfect switch) in reducing the reliability of 
a standby system. For what value of coverage are the parallel and 
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3.44. 

3.45. 

3.46. 

3.47. 


3.48. 


3.49. 

3.50. 


3.51. 

3.52. 

3.53. 


standby systems approximately equal? Compare with Fig. 3.13 and 
comment. 


Reformulate the Markov model of Fig. 3.14 for two nonidentical parallel 
elements with one repairman; then write the equations and solve for the 
reliability. Include coverage. 

Compute the MTTF for problem 3.44. 

Reformulate the Markov model of Fig. 3.14 for two identical parallel 
elements with one repairman and a nonzero standby failure rate. Write 
the equations and solve for the reliability. Include coverage. 

Compute the MTTF for problem 3.46. 

(In the following problems, you may wish to use a program that solves 
differential equations or Laplace transform equations algebraically or 
numerically: Maple, Mathcad, and so forth. See Appendix D.) 

Compute the availability of a single element with repair. Draw the 
Markov model and show that the availability becomes 


Ait) 


M 

X + fx 


X 

X + [A 


^-(\ + /pr 


Plot this availability function for pi = 10X, pi = 100X, and pi = 1,000X. 

If we apply the MTTF formula to the A(t) function, what quantity do 
we get? Compute for problem 3.48 and explain. 

Show how we can get the steady-state value of A(t) for problem 3.48, 


A[t -> oo) = 

A + [A 

in the following two ways: 

(a) Set the time derivatives equal to zero in the Markov equations and 
and combine with the equation that states that the sum of all the 
probabilities is unity. 

(b) Use the Laplace transform final value theorem. 

Solve the model of Fig. 3.16 for one repairman, an ordinary parallel 
system, and values of pi = 10X, pi = 100X, and pi = 1,000X. Plot the 
results. 

Find the steady-state value of A(f — » oo) for problem 3.51. 

Solve the model of Fig. 3.16 for one repairman, a standby system, and 
values of pi = 10X, pi = 100X, and pi = 1,000X. Plot the results. 


3.54. Find the steady-state value of Ait —» oo) for problem 3.53. 
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3.55. Solve the model of Fig. 3.16 augmented to include coverage for one 
repairman, an ordinary parallel system, and values of /r = 10X, n = 100X, 
fx = 1,000X, c = 0.95, and c = 0.90. Plot the results. 

3.56. Find the steady-state value of A(t —» °°) for problem 3.55. 

3.57. Solve the model of Fig. 3.16 augmented to include coverage for one 
repairman, a standby system, and values of pi = 10X, pi = 100X, pi = 
1,000X, c = 0.95, and c = 0.90. Plot the results. 

3.58. Find the steady-state value of A(t —» °o) for problem 3.57. 

3.59. Show by induction that Eq. (3.11) is always greater than unity. 

3.60. Derive Eqs. (3.22) and (3.23). 

3.61. Derive Eqs. (3.27) and (3.28). 

3.62. Consider the effect of common mode failures on the computation of Eq. 
(3.45). How large would the probability of common mode failures have 
to be to negate the advantage of a 20:21 system? 

3.63. Formulate a Markov model for a Tandem computer system. Include 
the possibilities of errors of commission and omission in generating the 
heartbeat signal—a coverage factor representing the fraction of proces¬ 
sor faults that the heartbeat signal would diagnose. Discuss, but do not 
solve. 

3.64. Formulate a Markov model for a Stratus computer system. Include the 
possibilities of errors of commission and omission in the hardware com¬ 
parison of the various processors. This could be modeled by a coverage 
factor representing the fraction of processor faults that go undetected by 
the comparison logic. Discuss, but do not solve. 

3.65. Compare the models of problems 3.63 and 3.64. What factors will deter¬ 
mine which system has a higher availability? 

3.66. Determine what fault-tolerant features are supported by the latest release 
of the Sun operating system. 

3.67. Model the reliability of the system described in problem 3.66. 

3.68. Model the availability of the system described in problem 3.66. 

3.69. Search the Web to see if the Digital Equipment Corporation’s popular 
VAX computer clusters are still being produced by Digital now that they 
are owned by Compaq. (Note: Tandem is also owned by Compaq.) If 
so, compare with the Sun cluster system. 
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A-MODULAR REDUNDANCY 


4.1 INTRODUCTION 

In the previous chapter, parallel and standby systems were discussed as means 
of introducing redundancy and ways to improve system reliability. After the 
concepts were introduced, we saw that one of the complicating design fea¬ 
tures was that of the coupler in a parallel system and that of the decision unit 
and switch in a standby system. These complications are present in the design 
of analog systems as well as digital systems. However, a technique known as 
voting redundancy eliminates some of these problems by taking advantage of 
the digital nature of the output of digital elements. The concept is simple to 
explain if we view the output of a digital circuit as a string of bits. Without 
loss of generality, we can view the output as a parallel byte (8 bits long). (The 
concept generalizes to serial or parallel outputs n bits long.) Assume that we 
apply the same input to two identical digital elements and compare the out¬ 
puts. If each bit agrees, then either they are both working properly (likely) or 
they have both failed in an identical manner (unlikely). Using the concepts of 
coding theory, we can describe this as an error-detection, not an error-correc¬ 
tion, method. If we detect a difference between the two outputs, then there is 
an error, although we cannot tell which element is in error. Suppose we add 
a third element and compare all three. If all three outputs agree bitwise, then 
either all three are working properly (most likely) or all three have failed in the 
same manner (most unlikely). If two of the element outputs (say, one and three) 
agree, then most likely element two has failed and we can rely on the output 
of elements one and three. Thus with three elements, we are able to correct 
one error. If two errors have occurred, it is very possible that they will fail in the 
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same manner, and the comparison will agree (vote along) with the majority. 
The bitwise comparison of the outputs (which are Is or Os) can be done easily 
with simple digital logic. The next section references some early works that 
led to the development of this concept, now called N-modular redundancy. 

This chapter and Chapter 3 are linked in many ways. For example, the tech¬ 
nique of voting reliability joins the parallel and standby system reliability of 
the previous chapter as the three most common techniques for fault tolerance. 
(Also, the analytical techniques involving binomial probabilities and Markov 
models are used in both chapters.) Thus many of the analyses in this chapter 
that are aimed at comparing the three techniques constitute a continuation of 
the analyses that were begun in the previous chapter. 

The reader not familiar with the binomial distribution discussed in Sections 
A5.3 and B2.4 or the concepts of Markov modeling in Sections A8 and B7 
should read the material in these appendix sections first. Also, the introductory 
material on digital logic in Appendix C is used in this chapter for discussing 
voter circuitry. 


4.2 THE HISTORY OF N-MODULAR REDUNDANCY 

The history of majority voting begins with the work of some of the most illus¬ 
trious mathematicians of the 20th century, as outlined by Pierce [1965, pp. 
2-7]. There were underlying currents of thought (linked together by theoreti¬ 
cians) that focused on the following: 

1. How to use automata theory (logic gates and state machines) to model 
digital circuit and digital computer operation. 

2. A model of the human nervous system based on an interconnection of 
logic elements. 

3. A means of making reliable computing machines from unreliable com¬ 
ponents. 

The third topic was driven by the maintenance problems of the early com¬ 
puters related to relay and vacuum tube failures. A study of the Univac com¬ 
puter that was undertaken by Bell and Newell [1971, pp. 157-169] yields 
insight into these problems. The first Univac system passed its acceptance tests 
and was put into operation by the Bureau of the Census in March 1951. This 
machine was designed to operate 24 hours per day, 7 days per week (168 
hours), except for approximately 32 hours of regularly scheduled preventa¬ 
tive maintenance per week. Thus the availability would be 136/168 (81%) if 
there were no failures. In the 7-month period from June to December 1951, the 
computer experienced about 22 hours of nonscheduled engineering time (repair 
time due to failures), which reduced availability to 114/168 (68%). Some of 
the stated causes of troubles were uniservo failures, noise, long time constants, 
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and tube failures occurring at a rate of about 2 per week. It is therefore clear 
that reliability was a compelling issue. 

Moore and Shannon of Bell Labs in a classic article [1956] developed meth¬ 
ods for making reliable relay circuits by various series and parallel connections 
of relay contacts. (The relay was the active element of its time in the switching 
networks of the telephone company as well as many elevator control systems 
and many early computers built at Bell Labs starting in 1937. See Randell 
[1975, Chapter VI] and Shooman [1990, pp. 310-320] for more information.) 
The classic paper on majority logic was written by John von Neuman (pub¬ 
lished in the work of Moore and Shannon [1956]), who developed the basic 
idea of majority voting into a sophisticated scheme with many NAND elements 
in parallel. Each input to the NAND element is supplied by a bundle of N iden¬ 
tical inputs, and the 2 N inputs are cross-coupled so that each NAND element 
has one input from each bundle. One of von Neuman’s elements was called 
a restoring organ, since erroneous data that entered at the input was com¬ 
pared with the correct input data, producing the correct output and restoring 
the data. 


4.3 TRIPLE MODULAR REDUNDANCY 
4.3.1 Introduction 

The basic modular redundancy circuit is triple modular redundancy (often 
called TMR). The system shown in Fig. 4.1 consists of three parallel digi¬ 
tal circuits— A, B, and C —all with the same input. The outputs of the three 
circuits are compared by the voter, which sides with the majority and gives 
the majority opinion as the system output. If all three circuits are operating 
properly, all outputs agree; thus the system output is correct. However, if one 
element has failed so that it has produced an incorrect output, the voter chooses 
the output of the two good elements as the system output because they both 
agree; thus the system output is correct. If two elements have failed, the voter 
agrees with the majority (the two that have failed); thus the system output is 
incorrect. The system output is also incorrect if all three circuits have failed. 
All the foregoing conclusions assume that a circuit fault is such that it always 
yields the complement of the correct input. A slightly different failure model 
is often used that assumes the digital circuit to have a fault that makes it stuck- 
at-one (s-a-1) or stuck-at-zero (s-a-0). Assuming that rapidly changing signals 
are exciting the circuit, a failure occurs within fractions of a microsecond of 
the fault occurrence regardless of the failure model assumed. Therefore, for 
reliability purposes, the two models are essentially equivalent; however, the 
error-rate computation differs from that discussed in Section 4.3.3. For further 
discussion of fault models, see Siewiorek [1982, pp. 17; 105-107] and [1992, 
pp. 22; 32; 35; 37; 357; 804], 
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System 

inputs 

( 0 , 1 ) 



System 

output 

( 0 , 1 ) 


Figure 4.1 Triple modular redundancy. 


4.3.2 System Reliability 

To apply TMR, all circuits— A, B. and C—must have equivalent logic and 
must have the same truth tables. In most cases, they are three replications of 
the same design and are identical. Using this assumption, and assuming that 
the voter does not fail, the system reliability is given by 

R = P(A ■ B + A ■ C + B ■ C) (4.1) 

If all the digital circuits are independent and identical with probability of suc¬ 
cess p, then this equation can be rewritten as follows in terms of the binomial 
theorem. 


R = B( 3:3) + 5(2 : 3) 

~ ( 3 ) ~ + ( 2 ) ~ p t 

= 3p 2 - 2p 3 = p 2 (3 - 2p) (4.2) 

This is, of course, the reliability expression for a two-out-of-three system. The 
assumption that the digital elements fail so that they produce the complement 
of the correct input may not be valid. (It is, however, a worst-case type of 
result and should yield a lower bound, i.e., a pessimistic answer.) 

4.3.3 System Error Rate 

The probability model derived in the previous secton enabled us to compute 
the system reliability, that is, the probability of no failures. In many prob¬ 
lems, this is the primary measure of interest; however, there are also a number 
of applications in which another approach is important. In a digital commu¬ 
nications system, for example, we are interested not only in the probability 
that the system makes no errors but also in the error rate. In other words, we 
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assume that errors from temporary equipment malfunction or noise are not 
catastrophic if they occur only rarely, and we wish to compute the probability 
of such occurrence. Similarly, in digital computer processing of non-safety- 
critical data, we could occasionally tolerate an error without shutting down 
the operation for repair. A third, less clear-cut example is that of an inertial 
guidance computer for a rocket. At every computation cycle, the computer gen¬ 
erates a course change and directs the missile control system accordingly. An 
error in one computation will direct the missile off course. If the error is large, 
the time between computations moderately long, the missile control system 
and dynamics quick to respond, and the flight near its end, the target may be 
missed, from which a catastrophic failure occurs. If these factors are reversed, 
however, a small error will temporarily steer the missile off course, much as 
a wind gust does. As long as the error has cleared in one or two computa¬ 
tion cycles, the missile will rapidly return to its proper course. A model for 
computing transmission-error probabilities is discussed below. 

To construct the type of failure model discussed previously, we assume that 
one good state and two failed states exist: 

A\ = element A gives a one output regardless of input (stuck-at-one, or s-a-1) 

Aq = element A gives a zero output regardless of input (stuck-at-zero, or 
s-a-0) 

To work with this three-state model, we shall change our definition of reliability 
to “the probability that the digital circuit gives the correct output to any given 
input.” Thus, for the circuits of Fig. 4.1, if the correct output is to be a one, 
the probability expression is 


Pi = 1 - P(AA> + A 0 C 0 + B 0 C 0 ) (4.3a) 

Equation (4.3a) states that the probability of correctly indicating a one output is 
given by unity minus the probability of two or more “zero failures.” Similarly, 
the probability of correctly indicating zero output is given by Eq. (4.3b): 


P 0 = \-P(A l B l +A l C l +B l Ci) (4.3b) 

If we assume that a one output and a zero output have equal probability of 
occurrence, 1/2, on any particular transmisson, then the system reliability is 
the average of Eqs. (4.3a) and (4.3b). If we let 


P(A) = P(B) = P(C) = p 
P(A 1 ) = P(B 1 ) = P(C 1 ) = q l 
P(A 0 ) = P(Bq) = P(C 0 ) = qo 


(4.4a) 

(4.4b) 

(4.4c) 
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and assume that all states and all elements fail independently, keeping in mind 
that the expansion of the second term in Eq. (4.3a) has seven terms, then sub¬ 
stitution of Eqs. (4.4a-c) in Eq. (4.3a) yields the following equations: 


Pm 1 - P(A 0 B 0 ) - m 0 C 0 ) - P(B 0 Co) + 2P(A 0 B 0 C 0 ) (4.5a) 

2 ' (4.5b) 


= 1 - 3g 0 + 2 q 0 
Similarly, Eq. (4.3b) becomes 


Pq = 1 - P{A X B i) - P(A\C\) - P(B\C\) + 2P{A\B X C\) (4.6a) 
= 1-3 q\ + 2 q\ (4.6b) 


Averaging Eq. (4.5a) and Eq. (4.6a) gives 

Po + Pi 


P = 


1 


= -y (3<?o + 3 ?i- 2 ?o- 2 ^i) 


(4.7a) 

(4.7b) 


To compare Eq. (4.7b) with Eq. (4.2), we choose the same probability for 
both failure modes qo = qi = q\ therefore, p + qo + q\ = p + q + q = 1, and 
q = ( 1 —p)/ 2. Substitution in Eq. (4.7b) yields 


1 


1 


P = — + — p~ — p 
2 4 4 


(4.8) 


The two probabilities, Eq. (4.2) and Eq. (4.8), are compared in Fig. 4.2. 

To interpret the results, it is assumed that the digital circuit in Fig. 4.1 is 
turned on at t = 0 and that initially the probability of each digital circuit being 
successful is p = 1.00. Thus both the reliability and probability of successful 
transmission are unity. If after 1 year of continuous operation p drops to 0.750, 
the system reliability becomes 0.844; however, the probability that any one 
message is successfully transmitted is 0.957. To put the result another way, 
if 1,000 such digital circuits were operated for 1 year, on average 156 would 
not be operating properly at that time. However, the mistakes made by these 
machines would amount to 43 mistakes per 1,000 on the average. Thus, for 
the entire group, the error rate would be 4.3% after 1 year. 


4.3.4 TMR Options 

Systems with modular redundancy can be designed to behave in different 
ways in practice [Toy, 1987; Arsenault, 1980, p. 137]. Let us examine in more 
detail the way a TMR system works. As previously described, the TMR sys- 
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Figure 4.2 Comparison of probability of successful transmission with the reliability. 


tem functions properly if there are no system failures or one system failure. 
The reliability expression was previously derived in terms of the probability 
of element success, p, as 


R = 3p 2 - 2p 3 (4.9) 

If we assume a constant-failure rate X, then each component has a reliability 
p=e ^‘, and substitution into Eq. (4.9) yields 

R(t) = 3e~ 2Xt - 2e~ 3Xt (4.10) 

We can compute the MTTF for this system by integrating the reliability func¬ 
tion, which yields 


MTTF = 


3 

2X 


_ 2 _ _ 

3X “ 6X 


(4.11) 


Toy calls this a TMR 3-2 system because the system succeeds if 3 or 2 units 
are good. Thus when a second failure occurs, the voter does not know which 
of the systems has failed and cannot determine which is the good system. 

In some cases, additional information is available by such means as obser¬ 
vation (from a human operator or an automated system) of the two remaining 
units after the first failure occurs. For agreement in the event of failure, if one 
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of the two remaining units has behaved strangely or erratically, the “strange” 
system would be locked out (i.e., disconnected) and the other unit would be 
assumed to operate properly. In such a case, the TMR system really becomes a 
1 : 3 system with a voter, which Toy calls a TMR 3-2-1 system. Equation (4.9) 

will change, and we must add the binomial probability of 1 : 
that is, 5(1 :3) = 3p(\ - p) 2 , yielding 

3 to the equation, 

R = 3p 2 - 2p 3 + 3p(l - p) 2 = p 3 - 3p 2 + 3p 

(4.12a) 

Substitution of p = e~ Xr gives 


5(0 = e 3X '- 3e 2X '+ 3e Xf 

(4.12b) 

and an MTTF calculation yields 


„ ttf= _l_a + a = a 

3A 2A A 6A 

(4.13) 


If we compare these results with those given in Table 3.4, we see that on 
the basis of MTTF, the TMR 3-2 system is slightly worse than a system with 
two standby elements. However, if we make a series expansion of the two 
functions and compare them in the high-reliability region, the TMR 3-2 system 
is superior. In the case of the TMR 3-2-1 system, it has an MTTF that is 
nearly the same as two standby elements. Again, a series expansion of the two 
functions and comparison in the high-reliability region is instructive. 

For a single element, the truncated expansion of the reliability function e kl 
is 

R s =l-\t (4.14) 

For a TMR 3-2 system, the truncated expansion of the reliability function, Eq. 
(4.9), is 


Rtmr(3-2) = e- 2Xt (3 - 2e~ Xr ) = [1 - 2 \t + (2\t) 2 /2] 

• [3 - 2(1 -\t + (Ar) 2 /2)] = 1 - 3(A?) 2 (4.15) 

For a TMR 3-2-1 system, the truncated expansion of the reliability function, 
Eq. (4.12b), is 


R T mr(3-2-1) = e- 3Xt - 3e~ 2X ' + 3e~ Xr = [l - 3\t + {3\tf /2 - (3\tf /6} 

- 3[1 - 2Af + {2\tf/2 - (2Af) 3 /6] 

+ 3[1 -\t + {\tf/2 - (At) 3 /6] = 1 - AV (4.16) 


Equations (4.14), (4.15), and (4.16) are plotted in Fig. 4.3 showing the 
superiority of the TMR systems in the high-reliability region. Note that the 
TMR(3-2) system reliability decreases to about the same value as a single 
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Figure 4.3 Comparison of the reliability functions of a single system, a TMR 3-2 
system, and a TMR 3-2-1 system in the high-reliability region. 


element when Xt increases from about 0.3 to 0.35. Thus, the TMR is of most 
use for Ar < 0.2, whereas TMR (3-2-1) is of greater benefit and provides a 
considerably higher reliability for Xt < 0.5. 

For further comparisons of MTTF and reliability for A-modular systems, 
refer to the problems at the end of the chapter. 


4.4 A-MODULAR REDUNDANCY 
4.4.1 Introduction 

The preceding section introduced TMR as a majority voting scheme for 
improving the reliability of digital systems and components. Of course, this is 
the most common implementation of majority logic because of the increased 
cost of replicating systems. However, with the reduction in cost of digital sys¬ 
tems from integrated circuit advances, it is practical to discuss A-version voting 
or, as it is now more popularly called, A-modular redundancy. In general, A is 
an odd integer; however, if we have additional information on which systems 
are malfunctioning and also the ability to lock out malfunctioning systems, it 
is feasible to let A be an even integer. (Compare advanced voting techniques in 
Section 4.11 and the Space Shuttle control system example in Section 5.9.3.) 

The reader should note there is a pitfall to be skirted if we contemplate 
the design of, say, a 5-level majority logic circuit on a chip. If the five digital 
circuits plus the voter are all on the same chip, and if only input and output 
signals are accessible, there would be no way to test the chip, for which reason 
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additional best outputs would be needed. This subject is discussed further in 
Sections 4.6.2 and 4.7.4. 

In addition, if we contemplate using V-modular redundancy for a digital 
system composed of the three subsystems A, B, and C, the question arises: 
Do we use AA modular redundancy on three systems (A \B\C\, A 2 B 2 C 2 , and 
A^B^Ci) with one voter, or do we apply voting on a lower level, with one 
voter comparing /I 1 A 2 A 3 , a second comparing B\B 2 B\, and a third comparing 
C 1 C 2 C 3 ? If we apply the principles of Section 3.3, we will expect that voting 
on a component level is superior and that the reliability of the voter must be 
considered. This section explores such models. 

4.4.2 System Voting 

A general treatment of V-modular redundancy was developed in the 1960s 
[Knox-Seith, 1953; Pierce, 1961]. If one considers a system of 2 n + 1 voters 
(note that this is an odd number), parallel digital elements, and a single perfect 
voter, the reliability expression is given by 

2n +1 2n +1 / r% i\ 

R= X B(i: 2n + 1) = X U'(l - p) 2n + l i (4.17) 

i = n +1 i=n +1 \ ^ / 

The preceding expression is plotted in Fig. 4.4 for the case of one, three, 
five, and nine elements, assuming p = e x '. Note that as n —» °o, the MTTF of 
the system —» 0.69/A. The limiting behavior of Eq. (4.17) as n —> °° is dis¬ 
cussed in Shooman [1990, p. 302]; the reliability function approaches the three 
straight lines shown in Fig. 4.4. Further study of this figure reveals another 
important principle—/V-modular redundancy is only superior to a single sys¬ 
tem in the high-reliability region. To be more specific, V-modular redundancy 
is superior to a single element for A t < 0.69; thus, in system design, one must 
carefully evaluate the values of reliability obtained over the range 0 < t < 
maximum mission time for various values of n and A. 

Note that in the foregoing analysis, we assumed a perfect voter, that is, 
one with a reliability equal to unity. Shortly, we will discard this assumption 
and assign a more realistic reliability to voting elements. However, before we 
investigate the effect of the voter, it is germane to study the benefits of par¬ 
titioning the original system into subsystems and using voting techniques on 
the subsystem level. 

4.4.3 Subsystem Level Voting 

Assume that a digital system is composed of m series subsystems, each having 
a constant-failure rate A, and that voting is to be applied at the subsystem level. 
The majority voting circuit is shown in Fig. 4.5. Since this configuration is 
composed of just the m-independent series groups of the same configuration 
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Figure 4.4 Reliability of a majority voter containing 2 n + 1 circuits. (Adapted from 
Knox-Seith [1963, p. 12].) 


as previously considered, the reliability is simply given by Eq. (4.17) raised 
to the /nth power. 


R = 


2n + 1 

X 

i=n+ 1 


2/7 + 1 


/40 - Pss) 


2n +1 - i 


(4.18) 


where p ss is the subsystem reliability. 

The subsystem reliability p ss is, of course, not equal to a fixed value of p: it 
instead decays in time. In fact, if we assume that all subsystems are identical 
and have constant-hazard and -failure rates, and if the system failure rate if 
X, the subsystem failure rate would be \/n, and p ss = e Kl /' n . Substitution of 
the time-dependent expression (p ss — e Kt / m ) into Eq. (4.18) yields the time- 
dependent expression for R(t). 

Numerical computations of the system reliability functions for several val¬ 
ues of m and n appear in Fig. 4.6. Knox-Seith [1963] notes that as n —> the 
MTTF = 0.7/n/X. This is a direct consequence of the limiting behavior of Eq. 
(4.17), as was discussed previously. 

To use Eq. (4.18) in design, one chooses values of n and m that yield a 
value of R, which meets the design goals. If there is a choice of values (n, 
m) that yield the desired reliability, one would choose the pair that represents 
the lowest cost system. The subject of optimizing voter systems is discussed 
further in Chapter 7. 
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•*-- m, majority groups-► 

Total number of circuits = (2 n + 1 )m 

Figure 4.5 Component redundancy and majority voting. 


4.5 IMPERFECT VOTERS 

4.5.1 Limitations on Voter Reliability 

One of the main reasons for using a voter to implement redundancy in a digital 
circuit or system is the ease with which a comparison is made of the digital 
signals. In this section, we consider an imperfect voter and compute the effect 
that voter failure will have on the system reliability. (The reader should com¬ 
pare the following analysis with the analogous effect of coupler reliability in 
the discussion of parallel redundancy in Section 3.5.) 

In the analysis presented so far in this chapter, we have assumed that the 
voter itself cannot fail. This is, of course, untrue; in fact, intuition tells us that 
if the voter is poor, its unreliability will wipe out the gains of the redundancy 
scheme. Returning to the example of Fig. 4.1, the digital circuit reliability will 
be called p c , and the voter reliability will be called p v . The system reliability 
formerly given by Eq. (4.2) must be modified to yield 


R = p v (3p 2 c -2 p 3 c )= p v p 2 c (3 - 2 p c ) (4.19) 


To achieve an overall gain, the voting scheme with the imperfect voter must 
be better than a single element, and 


R> p c or 



(4.20) 


Obviously, this requires that 
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Figure 4.7 Plot of function p c (3 — 2 p c ) versus p c . 


— = PuPcO - 2 p c ) > 1 (4.21) 

Pc 


The minimum value of p v for reliability improvement can be computed by 
setting p v p c (3 - 2 p c ) =1. A plot of p c {3 - 2 p c ) is given in Fig. 4.7. One can 
obtain information on the values of p v that allow improvement over a single cir¬ 
cuit by studying this equation. To begin with, we know that since p v is a proba¬ 
bility, 0 < p v < 1. Furthermore, a study of Fig. 4.3 (lower curve) and Fig. 4.4 
(note that e 0 69 =0.5 )reminds us that /V-modular redundancy is only beneficial 
if 0 < p c < 1. Examining Fig. 4.7, we see that the minimum value of p v will be 
obtained when the expression p c ( 3— 2p c )=3p c — 2p 2 c . Differentiating with respect 
to p c and equating to zero yields p c = 3/4, which agrees with Fig. 4.7. Substitut¬ 
ing this value of p c into [p v p c (3 — 2 p c ) =1] yields p v = 8/9 =0.889, which is the 
reciprocal of the maximum of Fig. 4.7. (For additional details concerning voter 
reliability, see Siewiorek [1992, pp. 140-141].) This result has been generalized 
by Grisamone [1963] for A-voter redundancy, and the results are shown in Table 
4.1. This table provides lower bounds on voter reliability that are useful during 
design; however, most voters have a much higher reliability. The main objective 
is to make p v close enough to unity by using reliable components, by derating, 
and by exercising conservative design so that the voter reliability has only a neg¬ 
ligible effect on the value of R given in Eq. (4.19). 

4.5.2 Use of Redundant Voters 

In some cases, it is not possible to devise individual voters that have a high 
enough reliability to meet the requirements of an ultrareliable system. Since the 
voter reliability multiplies the /V-modular redundancy reliability, as illustrated 
in Eq. (4.19), the system reliability can never exceed that of the voter. If voting 
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TABLE 4.1 Minimum Voter Reliability 

Number of redundant circuits. 

In + 1 3 5 7 9 11 

Minimum voter reliability, p v 0.889 0.837 0.807 0.789 0.777 0.75 


is done at the component level, as shown in Fig. 4.5, the situation is even 
worse: the reliability function in Eq. (4.18) is multiplied by p'”, which can 
significantly lower the reliability of the A-modular redundancy scheme. In such 
cases, one should consider the possibility of using redundant voters. 

The standard TMR configuration including redundant voters is shown in Fig. 
4.8. Note that Fig. 4.8 depicts a system composed of n subsystems with a triple 
of subsystems A , B, and C and a triple of voters V, V', V". Also, in the last stage 
of voting, only a single voter can be employed. One interesting property of the 
circuit in Fig. 4.8 is that errors do not propagate more than one stage. If we assume 
that subsystems A\,B\, and C\ are all operating properly and that their outputs 
should be one, then the outputs of the triplicated voters V) should also all be one. 
Say that one circuit, B\, has failed, yielding a zero output; then, each of the three 
voters V\, V[, V" will agree with the majority {A\= C\ = 1) and have a unity 
output, and the single error does not show up at the output of any voter. In the case 
of voter failure, say that voter V" fails and yields an erroneous output of zero. 
Circuits A 2 and B 2 will have the correct inputs and outputs, and C 2 will have an 
incorrect output since it has an incorrect input. However, the next stage of voters 
will have two correct inputs from A 2 and B 2 , and these will outvote the erroneous 
output from V"; thus, voters V 2 , V 2 , and V” will all have the correct output. One 
can say that single circuit errors do not propagate at all and that single voter errors 
only propagate for one stage. 

The reliability expressions for the system of Fig. 4.8 and other similar 
arrangements are more complex and depend on which of the following assump¬ 
tions (or combination of assumptions) is true: 

1. All circuits A,-, B n and C, and voters V, are independent circuits or inde¬ 
pendent integrated circuit chips. 

2. All circuits A,, B,, and C, are independent circuits or independent inte¬ 
grated circuit chips, and voters V,, V', and V" are all on the same chip. 



Output 


Figure 4.8 A TMR circuit with redundant voters. 
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3. All voters V,-, V', and V" are independent circuits or independent inte¬ 
grated circuit chips, and circuits A;, Bj, and C, are all on the same chip. 

4. All circuits A,-, 5,, and C, are all on the same chip, and voters V„ V', 
and V" are all on the same chip. 

5. All circuits A,-, and C, and voters V,-, V', and V," are on one large 
chip. 

Reliability expressions for some of these different assumptions are developed 
in the problems at the end of this chapter. 

4.5.3 Modeling Limitations 

The emphasis of this book up to this point has been on analytical models for 
predicting the reliability of various digital systems. Although this viewpoint 
will also prevail for the remainder of the text, there are limitations. This section 
will briefly discuss a few situations that limit the accuracy of analytical models. 

The following situations can be viewed as effects that are difficult to model 
analytically, that lead to pessimistic results from analytical models, and that 
represent cases in which the methods of Appendix D would be warranted. 

1. Some of the failures in digital (and analog) systems are transient in nature 
[compare the rationale behind adaptive voting; see Eq. (4.63)]. A trans¬ 
ient failure only occurs over a brief period of time or following certain 
triggering events. Thus the equipment may or may not be operating at 
any point in time. The analysis associated with the upper curve in Fig. 
4.2 took such effects into account. 

2. Sometimes, the resulting output of a TMR circuit is correct even if there 
are two failures. Suppose that all three circuits compute one bit, that unit 
two is good, unit one has failed s-a-1, and that unit three has failed s-a- 
0. If the correct output should be a one, then the good unit produces a 
one output that votes along with the failed unit one, producing a correct 
voter output. Similarly, if zero were the correct output, unit three would 
vote with the good unit, producing a correct voter output. 

3. Suppose that the circuit in question produces a 4-bit binary word and that 
circuit one is working properly and produces the 4-bit word 0110. If the 
first bit of circuit two is bad, we obtain 1110; if the last bit of circuit three 
is bad, we obtain 0111. Thus, if we vote on the three complete words, 
then no two agree, but if we vote on the outputs one bit at a time, we 
get the correct results for all bits. 

The more complex fault-tolerant computer programs discussed in Appendix 
D allow many of these features, as well as other, more complex issues, to be 
modeled. 
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TABLE 4.2 A Truth Table for a Three-Input Majority 
Voter 



Inputs 



Outputs 

Xi 

X2 

X3 


/ y (xlX2X3) 

0 

0 

0 

0 

Two 

0 

0 

1 

0 

or 

0 

1 

0 

0 

three 

1 

0 

0 

0 

zeroes 

1 

1 

0 

1 

Two 

1 

0 

1 

1 

or 
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1 

1 

1 

three 
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1 

1 

1 

ones 


4.6 VOTER LOGIC 
4.6.1 Voting 

It is useful to discuss the structure of a majority logic voter. This allows the 
designer to appreciate the complexity of a voter and to judge when majority 
voter techniques are appropriate. The structure of a voter is easy to realize 
in terms of logic gates and also through the use of other digital logic-design 
techniques [Shiva, 1988; Wakerly, 1994]. The basic logic function for a TMR 
voter is based on the Truth Table given in Table 4.2, which leads to the simple 
Karnaugh map shown in Table 4.3. 

A direct approach to designing a majority voter is to include a term for 
all the minterms in Table 4.2, that is, the last four rows corresponding to an 
output of one. The logic circuit would require three three-input AND gates, a 
three-input OR gate, and three inverters (NOT gates) for each bit. 


f v (x 1 X 2 X 3 ) = V 1 X 2 X 3 + X 1 X 2 X 3 + x 1 X 2 X 3 (4.22) 


TABLE 4.3 Karnaugh Map for a TMR Voter 


x 9 x- 


2 A 3 


00 01 11 


10 


0 

0 

1 

0 

0 

1 

1 

1 


1 
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TABLE 4.4 Minterm Simplification for Table 4.3 



The minterm simplification for the TMR voter is shown in Table 4.4 and 
yields the logic function given in Eq. (4.23). The result of the simplification 
yields a voter logic function, as follows: 


f v (x 1 X 2 X 3 ) = xix 2 + x 1 X 3 + X 2 X 3 (4.23) 

Such a circuit is easy to realize with basic logic gates as shown in Fig. 4.9(a), 
where three AND gates plus one OR gate is used, and in Fig. 4.9(b), where four 


System 

inputs 

( 0 , 1 ) 


V j ^ -A ^ 


Digital circuit 
A 


Digital circuit 
B 


Digital circuit 
C 


> 

> 

> 


33 - 


System 

output 

(0,1) 


(a) 



Figure 4.9 Two circuit realizations of a TMR voter, (a) A voter constructed from 
AND/OR gates; and (b) a voter constructed from NAND gates. 
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NAND gates are used. The voter in Fig. 4.9(b) can be seen as equivalent to 
that in Fig. 4.9(a) if one examines the output and applies DeMorgan’s theorem: 


f v (x 1X2X3) = (X1X2) • (X1X3 ) ■ (x 2 x 3 ) = X1X2 + X1X3 + X2X3 (4.24) 


4.6.2 Voting and Error Detection 

There are many reasons why it is important to know which circuit has failed 
when V-modular redundancy is employed, such as the following: 

1. If a panel with light-emitting diodes (LEDs) indicates circuit failures, the 
operator has a warning about which circuits are operative and can initiate 
replacement or repair of the failed circuit. This eliminates much of the 
need for off-line testing. 

2. The operator can take the failure information into account in making a 
decision. 

3. The operator can automatically lock out a failed circuit. 

4. If spare circuits are available, they can be powered up and switched in 
to replace a failed component. 

If one compares the voter inputs the first time that a circuit disagrees with 
the majority, a failed warning can be initiated along with any automatic action. 
We can illustrate this by deriving the logic circuits that would be obtained 
for a TMR system. If we let f v (x 1X2X3) represent the voter output as before 
and f e 1 (x 1 X 2 X 3 ), / e2 (x \ x 2 x 3 ), and/^(x 1 X 2 X 3 ) represent the signals that indicate 
errors in circuits one, two, and three, respectively, then the truth table shown 
in Table 4.5 holds. 

A simple logic realization of these 4 outputs using NAND gates is shown in 


TABLE 4.5 Truth Table for a TMR Voter Including Error-Detection 
Outputs 



Inputs 




Outputs 
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X\ X2 X2 -^3 


System 

inputs 

( 0 , 1 ) 



Figure 4.10 Circuit that realizes the four switching functions given in Table 4.5 for 
a TMR majority voter and error detector. 


Fig. 4.10. The reader should realize that this circuit, with 13 NAND gates and 3 
inverters, is only for a single bit output. For a 32-bit computer word, the circuit 
will have 96 inverters and 416 NAND gates. In Appendix B, Fig. B7, we show 
that the integrated circuit failure rate, X, is roughly proportional t o the square 
root of the number of gates, \~^g, and for our example, A~ V512 = 22.6. 
If we assume that the circuit on which we are voting should have 10 times the 
failure rate of the voter, the circuit would have 51,076 or about 50,000 gates. 
The implication of this computation is clear: One should not employ voters 
to improve the reliability of small circuits because the voter reliability may 
wipe out most of the intended improvement. Clearly, it would also be wise 
to consult an experienced logic circuit designer to see if the 512-gate circuit 
just discussed could be simplified by using other technology, semicustom gate 
circuits, available microelectronic chips, and so forth. 

The circuit given in Fig. 4.10 could also be used to solve the chip test prob¬ 
lem mentioned in Section 4.4.1. If the entire circuit of Fig. 4.10 were on a 
single IC, the outputs “circuit A, B, C bad” would allow initial testing and 
subsequent monitoring of the IC. 
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4.7 N-MODULAR REDUNDANCY WITH REPAIR 

4.7.1 Introduction 

In Chapter 3, we argued that as long as the operating system possesses redun¬ 
dancy, the addition of repair raises the reliability. One might ask at the outset 
why A L modular redundancy should be used with repair when ordinary parallel 
or standby redundancy with repair is very effective in achieving highly reli¬ 
able and available systems. The answer to this question involves the coupling 
device reliability that was explored in Chapter 3. To be specific, suppose that 
we wish to compare the reliability of two parallel systems with that of a TMR 
system. Both systems fail if two of the elements fail, but in the TMR case, 
there are three systems that could fail; thus the probability of failure is higher. 
However, in general, the coupler in a parallel system will be more complex 
than a TMR voter, so a comparison of the two designs requires a detailed eval¬ 
uation of coupler versus voter reliability. Analysis of TMR system reliability 
and availability can be found in Siewiorek [1992, p. 335] and in Toy [1987]. 

4.7.2 Reliability Computations 

One might expect that it would be most efficient to seek a general solution 
for the reliability and availability of a system with /V-modular redundancy and 
repair, then specify that N = 3 for a TMR system, N = 5 for 5-level voting, and 
so on. A moment’s thought, however, suggests quite a different approach. The 
conventional solution for the reliability and availability of a system with repair 
involves making a Markov model and solving it much as was done in Chapter 
3. In the process, the Laplace transform was computed, and a partial fraction 
expansion was used to find the individual exponential terms in the solution. For 
the case of repair, in general the repair rates couple the n states, and solution 
of the set of n first-order differential equations leads to the solution of an /7th- 
order differential equation. If one applies Laplace transform theory, solution 
of the nth-order differential equation is “transformed into” a simpler sequence 
of steps. However, one step involves the solution for the roots of an nth-order 
polynomial. 

Unfortunately, closed-form solutions exist only for first- through fourth- 
order polynomials, and solution procedures for cubic and quadratic polynomi¬ 
als are lengthy and seldom used. We learned in high-school algebra the formula 
for the roots of a quadratic equation (polynomial). A somewhat more complex 
solution exists for the solution of a cubic, which is listed in various handbooks 
[Iyanaga, p. 1396], and also for a fourth-order equation [Iyanaga, p. 1396]. 

A brief historical note about the origin of closed-form solutions is of interest. 
The formula for the third-order equation is generally attributed to Giordamo 
Cardano (also known as Jerome Cardan) [Cardano, 1545; Cardan, 1963]; how¬ 
ever, he obtained the solution from Nicolo Tartaglia, and apparently it was dis¬ 
covered by Scipio Ferreo in circa 1505 [Hall, 1957, pp. 480-481]. Ludovico 
Ferrari, a pupil of Cardan, developed the formula for the fourth-order equation. 
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Neils Henrik Abel developed a proof that no closed-form solution exists for 
n > 5 [Iyanaga, p. 1]. 

The conclusion from the foregoing information on polynomial roots is that 
we should start with TMR and other simpler systems if we wish to use alge¬ 
braic solutions. Numerical solutions are always possible for higher-order equa¬ 
tions, and the mathematical software discussed in Appendix D expedites such 
an approach; however, the insight of an analytical solution is generally lacking. 
Another approach is to use simplifications and approximations such as those 
discussed in Appendix B (Sections B8.2 and B8.3). We will use the tried and 
true three-step engineering approach: 

1. Represent the main features of the system by a low-order model that is 
amenable to closed-form solution. 

2. Add further effects one at a time that complicate the model; study the 
effect (if necessary, use simplifying assumptions and approximations or 
numerical results computed over a range of parameters). 

3. Put all the effects into a comprehensive model and solve numerically. 

Our development begins by studying the reliability and availability of a 
TMR system, assuming that the design is truly TMR or that we are using a 
TMR model as step one in our solution approach. 

4.7.3 TMR Reliability 

Markov Model. We begin the analysis of voting systems with repair by ana¬ 
lyzing the reliability of a TMR system. The Markov reliability diagram for a 
TMR system composed of a voter, V, and three digital subsystems xi, X 2 , and 
x$ is given in Fig. 4.11. It is assumed that the xs are identical and have the 
same failure rate, X, and that the voter does not fail. 

If we compare Fig. 4.11 with the model given in Fig. 3.14 of Chapter 3, 
we see that they are essentially the same, only with different parameter values 
(transition rates). There are three states in both models: repair occurs from 
state ,v i to .s'o, and state S 2 is an absorbing state. (Actually, a complete model 
for Fig. 4.11 would have a fourth state, 53 , which is reached by an additional 
failure from state ,s'i. However, we have included both states in state S 2 since 
either two or three failures both represent system failure. As a rule, it is almost 
always easier to use a Markov model with fewer states even if one or more of 
the states represent combined states. State S 2 is actually a combined state, also 
known as a merged state, and a complete discussion of the rules for merging 
appears in Shooman [1990, p. 529]. One could decompose the third state in 
Fig. 4.11 into .sy = 11 X 2 X 3 + X 1 X 2 X 3 + X 1 X 2 X 3 and S 3 =xiX 2 X 3 by reformulating 
the model as a more complex four-state model. However, the four-state model 
is not needed to solve for the upstate probabilities P SQ and P S] . Thus the simpler 
three-state model of Fig. 4.11 will be used.) 
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1 - 3XA t 1 - (2X + ii)At 1 
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One failure Two or three failures 


s 0 = x 1 x 2 x 3 


X X X 2 X 3 + X 1 x 2 x 3 
+ x 1 x 2 x 3 


x 1 x 2 x 3 + X 3 X 2 X 3 
+ x 3 x 2 x 3 + x^x 2 x 3 


Figure 4.11 A Markov reliability model for a TMR system with repair. 


In the TMR model of Fig. 4.11, there are three ways to experience a single 
failure from so to si and two ways for failures to move the system state from 
to S 2 ■ Figure 3.14 of Chapter 3 uses failure rates of X' and X in the model; by 
substituting appropriate values, the model could hold for two parallel elements 
or for one on-line and one standby element. One can save repeating a lot of 
analysis and solution by realizing that the solution given in Eqs. (3.62)-(3.66) 
will also hold for the model of Fig. 4.11 if we let = 3X (three ways to go 
from state si to state S 2 ); X = 2X (two ways to go from state S 2 to state S3); 
and /jl' = n (single repairman in both cases). Substituting these values in Eqs. 
(3.65) yields 


p ( s) s + 2\ +IX 

s ° s 2 + (5X + ix)s + 6X 2 

(4.25a) 

p (A- 3X 

1 s 2 + (5X + ix)s + 6X 2 

(4.25b) 

P (,) - 6X 

s[s 2 + (5X + /x)s + 6X 2 ] 

(4.25c) 


Note that as a check, we sum Eqs. (4.25a-c) and obtain the value 1/s, which 
is the transform of unity. Thus the three equations sum to 1, as they should. 

One can add the equations for P so and P S] to obtain the reliability of a TMR 
system with repair in the transform domain. 


X?tmr(s) 


s + 5 X + n 
s~ + (5 X + n)s + 6X 2 


(4.26a) 


The denominator polynomial factors into (s + 2X) and (s + 3X), and partial 
fraction expansion yields 
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^TMr(s) - 


3 A + /x 


2A + n 


(4.26b) 


s + 2A s + 3\ 

Using transform #4 in Table B6 in Appendix B, we obtain the time function: 


TMR« = (3 +y) ^ 2xr - (2 + ^) e - 3X ' (4.26c) 

One can check the above result by letting /x = 0 (no repair), which yields 
/?tmr (0 =3e 2X ' — 2e 3X ', and if p = e x ', this becomes f? TMR = 3/r — 2p 3 , 
which of course agrees with the result previously computed [see Eq. (4.2)]. 


Initial Behavior. The complete solution for the reliability of a TMR system 
with repair is given in Eq. (4.26c). It is useful to practice with the simplifying 
effects of initial behavior, final behavior, and MTTF solutions on this simple 
problem before they are applied later in this chapter to more complex models 
where the simplification is needed. One can evaluate the effects of repair on 
the initial behavior of the TMR system simply by using the transform for t n , 
which is discussed in Appendix B, Section B8.3. We begin with Eq. (4.26a), 
where division of the denominator into the numerator using polynomial long 
division yields for the hist three terms: 


1 

^tmrU)- — 


6 A 2 6A 2 (5A + /x) 

s 3 s 4 


(4.27a) 


Using inverse transform no. 5 of Table B6 of Appendix B yields 


L 


\ --- t n ~ 1 e~ at \ = --- 

\ (n - 1)! J (s + a) n 


Setting a = 0 yields 


L 


\ 1 ‘ j = 1 

1 (n - 1)! J (■?)" 


(4.27b) 


(4.27c) 


Using the transform in Eq. (4.27c) converts Eq. (4.27a) into the time function, 
which is a three-term polynomial in t (the first three terms in the Taylor series 
expansion of the time function). 


R T mr( 0 = 1 - 3A 2 r + A 2 (5A + n)t 3 • • • (4.27d) 

We previously studied the first two terms in the Taylor series expansion of 
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the TMR reliability expansion in Eq. (4.15). In Eq. (4.27d), we have a three- 
term solution, and one can compare Eqs. (4.15) and (4.27b) by calculating an 
additional third term in the expansion of Eq. (4.15). The expansions in Eq. 
(4.15) are augmented by including the cubic terms in the expansions of the 
bracketed terms, that is, — 4A 3 f 3 /3 in the first bracket and +XV’/3 in the second 
bracket. Carrying out the algebra adds a third term, and Eq. (4.15) becomes 
expanded as follows: 


Rtmr(3-2) = 1 - 3A 2 # 2 + 5 XV (4.27e) 

Thus the first three terms of Eq. (4.15) and Eq. (4.27d) are identical for the 
case of no repair, p, = 0. Equation (4.27d) is larger (closer to unity) than the 
expanded version of Eq. (4.15) because of the additional term +X 2 pt 3 that is 
significant for large values of repair rate; we therefore see that repair improves 
the reliability. However, we note that repair only affects the cubic term in Eq. 
(4.27d) and not the quadratic term. Thus, for very small t, repair does not 
affect the initial behavior; however, from the above solution, we can see that 
it is beneficial for small and modest size t. 

A numerical example will illustrate the improvement in initial reliability 
due to repair. Let p. = 10X; then the third term in Eq. (4.27d) becomes +15AV 
rather than +5 A 3 / 3 with no repair. One can evaluate the increase due to p = I OX 
at one point in time by letting t = 0.1 /X. At this point in time, the TMR 
reliability without repair is equal to 0.975; with repair, it is 0.985. Further 
comparisons of the effects of repair appear in the problems at the end of the 
chapter. 

The approximate analysis of this section led to a useful evaluation of the 
effects of repair through the computation of the power series expansion of the 
time function for the model with repair. This approximate result avoids the need 
to factor the denominator polynomial in the Laplace transform solution, which 
was found to be a stumbling block in obtaining a complete closed solution for 
higher-order systems. The next section will discuss the mean time to failure 
(MTTF) as another approximate solution that also avoids polynomial factoring. 

Mean Time to Failure. As we saw in the preceding chapter, the computa¬ 
tion of MTTF greatly simplifies the analysis, but it is not without pitfalls. The 
MTTF computes the “area under the reliability curve” (see also Section 3.8.3). 
Thus, for a single element with a reliability function of e Kl . the area under the 
curve yields 1/A; however, the MTTF calculation for the TMR system given 
in Eq. (4.11) yields a value of 5/6X. This implies that a single element is bet¬ 
ter than TMR, but we know that TMR has a higher reliability than a single 
element (see also Siewiorek [1992, p. 294]). The explanation of this apparent 
contradiction is simple if we examine the n = 0 and n = 1 curves in Fig. 4.4. 
In the region of primary interest, 0 < Xr < 0.69, TMR is superior to a single 
element, but in the region 0.69 < Xr < °° (not a region of primary interest), 
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the single element has a superior reliability. Thus, in computing the integral 
between t = 0 and t = °°, the long tail controls the result. The lesson is that 
we should not trust an MTTF comparison without further study unless there is 
a significant superiority or unless the two reliability functions have the same 
shape. Clearly, if the two functions have the same shape, then a comparison 
of the MTTF values should be definitive. Graphing of reliability functions in 
the high-reliability region should always be included in an analysis, especially 
with the ready availability, power, and ease provided by software on a modern 
PC. One can also easily integrate the functions in question by using an analysis 
program to compute MTTF. 

We now apply the simple method given in Appendix B, Section B8.2 to 
evaluate the MTTF by letting 5 approach zero in the Laplace transform of the 
reliability function—Eq. (4.26a). The result is 

MTTF = 5 + ^ X (4.28) 

6 a 

To evaluate the effect of repair, let fi = 10X. The MTTF without repair increases 
from 5/6X to 16/6X—a threefold improvement. 


Final Behavior. The Laplace transform has a simple theorem that allows us 
to easily calculate the final value of a time function based on its transform. 
(See Appendix B, Table B7, Theorem 7.) The final-value theorem states that 
the value of the time function /(f) as t —» 00 is given by sF(s) (the transform 
multiplied by s) as s —> 0. Applying this to Eq. (4.26a), we obtain 


lim {.vX| MK ) = lim 

S—>0 s—» 0 


s(s + 5X + fi) 
s 2 + (5X + /x)s + 6X 2 


(4.29) 


A little thought shows that this is the correct result since all reliability func¬ 
tions go to zero as time increases. Flowever, when we study the availability 
function later in this chapter, we will see that the final value of the availability 
is nonzero. This value is an important measure of system behavior. 


4.7.4 /V-Modular Reliability 

Flaving explored the analysis of the reliability of a TMR system with repair, 
it would be useful to develop general expressions for the reliability, MTTF, 
and initial behavior for /V-modular systems. This task is difficult and probably 
unnecessary since most practical systems have 3- or 5-level majority voting. 
(An intermediate system with 4-level voting used by NASA in the Space Shut¬ 
tle will be discussed later in this chapter.) The main focus of this section will 
therefore be the analysis. 

Markov Model. We begin the analysis of 5-level modular reliability with 
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1 - 5XA t 1 - (4X + n)A t 1 - (3X + n)At 


1 



Zero failures 
■s 0 = x i x 2 x 3 x 4 x 5 


One failure 

5 ^ = X^X 2 X 3 X 4 X 3 

+ XjX 2 X 3 X 4 X 5 
+ XjX 2 X 3 X 4 X 5 
+ XjX 2 X 3 X 4 X 5 
+ x t x 2 x 3 x 4 x 5 


Two failures 

S 2 = 

+ x x x 2 x 3 x 4 x 5 

+ X^X^X^Xs 

+ (6 more terms) 


Three or more 
failures 
s 3 — x^x 2 x 3 x 4 x^ 

+ x^x 2 x 3 x 4 x^ 

+ X^X^Xs 

+ x^x 2 x 3 x 4 x 3 
+ (12 more terms) 


Figure 4.12 A Markov reliability model for a 5-level majority voting system with 
repair. 


repair by formulating the Markov model given in Fig. 4.12. We follow the same 
approach used to formulate the Markov model given in Fig. 4.11. There are, 
however, additional states. (Actually, there is one additional state that lumps 
together three other states.) 

The Markov time-domain differential equations are written in a manner 
analogous to that used in developing Eqs. (3.62a-c). The notation P s = dPJdt 
is used for convenience, and the following equations are obtained: 


P so (0 = 

_ 5 \P S0 (t) + 

/*P*l(0 

(4.30a) 

p„(0 = 

5 \P S0 (t) - 

(4\ + jx)P Sl (t) + nP S2 (t ) 

(4.30b) 

Ps 2 (t) = 


4\ P S| (?) - (3X + fi)P S2 (t) 

(4.30c) 

Ps 3 (t ) = 


3 \P S2 (t) 

(4.30d) 


Taking the Laplace transform of the preceding equations and incorporating 
the initial conditions P so (0) = 1, P S| (0) = P V2 (0) = P v ,(0) = 0 leads to the 
transformed equations as follows: 

(s + 5\)P S0 (s)- h ,P sl (s ) =1 (4.31a) 

-5\P S0 (s) + (s + 4A + n)P sl (s) - i^P S2 ( s ) = 0 (4.31b) 

4XP. vl (s) + (s + 3X + h)P S2 (s) =0 (4.31c) 

3\P S2 (s) + .sP s ,(.s) = 0 (4.3Id) 

Equations (4.31a-d) can be solved by a variety of means for the probabili¬ 
ties P so (t), P v (t), P S2 (t), and P J3 (f). One technique based on Cramer’s rule is 
to formulate a set of determinants associated with the equations. Each of the 
probabilities becomes a ratio of two of the determinants: a numerator deter- 
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minant divided by a denominator determinant. The denominator determinant 
is the same for each ratio; it is generally denoted by A and is the determinant 
of the coefficients of the equations. (One can develop the form of these equa¬ 
tions in a more elaborate fashion using matrix theory; see Shooman [1990, pp. 
239-243].) A brief inspection of Eqs. (4.31a-d) shows that the first three are 
uncoupled from the last and can be solved separately, simplifying the algebra 
(this will always be true in a Markov model with repair when the last state is 
an absorbing one). Thus, for the first three equations, 

s + 5A -fi 0 

A= -5A s + 4\ + n -ft (4.32) 

0 — 4A s + 3 A + ft 

The numerator determinants in the solution are similar to the denominator 
determinants; however, one column is replaced by the right-hand side of the 


Eqs. (4.31a-d); that is, 

1 -fi 0 

Ai = 0 s + 4\ + fi -fi (4.33a) 

0 -4A s + 3X + fi 

s + 5A 1 0 

A 2 = -5A 0 -fi (4.33b) 

0 0 s + 3X + fi 

s + 5A -fi 1 

A 3 = -5A j + 4A + /x 0 (4.33c) 

0 -4A 0 

In terms of this group of determinants, the probabilities are 

Psois) = -^- (4.34a) 

P si (s) = ^~ (4.34b) 

^,2 (.s) = -^- (4.34c) 

The reliability of the 5-level modular redundancy system is given by 

^5mr(0 = P S(j (t) + P si (t) + P S2 (t) (4.35) 
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Expansion of the denominator determinant yields the following polynomial: 


A = s 3 + (12X + 2 n)s 1 2 + (47 X 2 + 8 A/x + fx 2 )s + 60X 3 (4.36a) 

Similarly, expanding the other determinants yields the following polynomials: 

Ai = s 2 + (7X + 2 n)s + 12X 2 + 3X/x + \x 2 (4.36b) 

A? = 5A(s + 3X + /x) (4.36c) 

A 3 = 20X 2 (4.36d) 


Substitution in Eqs. (4.34a-c) and (4.35) yields the transform of the reliability 
function: 


R5 mr(^) 


s'^ + (12X + 2/x)s + 47 X + 8Apr + pt 
s 3 + (12X + 2 n)s 2 + (47A 2 + 8Apx + pi 2 )s + 60X 3 


(4.37) 


As a check, we compute the probability of being in the fourth state P S3 (s) from 
Eq. (4.3Id) as 


Ps 3 (s) 


3AP, 2 (s) 

s 


60A 3 

sA 


(4.38) 


Adding Eq. (4.37) to Eq. (4.38) and performing some algebraic manipulation 
yields 1/s, which is the transform of unity. Thus the sum of all the state prob¬ 
abilities adds to unity as it should and the results check. 


Initial Behavior. As in the preceding section, we can model the initial behav¬ 
ior by expanding the transform Eq. (4.37) into a series in inverse powers of s 
using polynomial division. The division yields 


1 60A 3 60A 3 (12A + 2px) 

R5mr(s) = - -— +-—-— 

2 s 4 s 3 


(4.39a) 


Applying the inverse transform of Eq. (4.27c) yields 

Rsmr(s) = 1 - 10XV + 2.5A 3 (12A + 2 n)t 4 ■ ■ ■ (4.39b) 

We can compare the gain due to 5-level modular redundancy with repair 
to that of TMR with repair by letting pt = 10X and t = 0.1 /X, as in Section 
4.7.3, which gives a reliability of 0.998. Without repair, the reliability would 
be 0.993. These values should be compared with the TMR reliability without 
repair, which is equal to 0.975, and TMR with repair, which is 0.985. Since it 
is difficult to compare reliabilities close to unity, we can focus on the unreli¬ 
abilities with repair. The 5-level voting has an unreliability of 0.002; the TMR, 
0.015. Thus, the change in voting from 3-level to 5-level has reduced the unre- 
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TABLE 4.6 Comparison of the MTTF for Several Voting and Parallel 
Systems with Repair 


System 

MTTF Equation 

/x = 0 

H = 10 

/t = 100 

TMR with repair 

5+ r 

0.83 

2.5 

17.5 

6X 

X 

IT 

X 

5MR with repair 


0.78 

3.78 

180.78 

60X 3 

X 

X 

X 

Two parallel 

3 \ + fx 

1.5 

6.5 

51.5 

2X 2 

IT 

IT 

X 

Two standby 

2 \ + ii 

2 

12 

102 

X 2 


X 

X" 


liability by a factor of 7.5. Further comparisons of the effects of repair appear 
in the problems at the end of this chapter. 


Mean Time to Failure Comparison. The MTTF for 5-level voting is easily 
computed by letting s approach 0 in the transform equation, which yields 


MTTF 5 MR 


47 X + 8X/r + 
60X 3 


(4.40) 


This MTTF is compared with some other systems in Table 4.6. The table 
shows, as expected, that 5MR is superior to TMR when repair is present. Note 
that two parallel or two standby elements appear more reliable. Once reduction 
in reliability due to the reliability of the coupler and coverage is included and 
compared with the reduction due to the reliability of the voter, this advantage 
may disappear. 

Initial Behavior Comparison. The initial behavior of the systems given in 
Table 4.6 is compared in Table 4.7 using Eqs. (4.27d) and (4.39b) for TMR and 
5MR systems. For the case of two ordinary parallel and two standby systems, 
we must derive the initial behavior equation by adding Eqs. (3.65a) and (3.65b) 
to obtain the transform of the reliability function that holds for both parallel 
and standby systems. 


R(s) = P S0 (s) + P sl (s) 


s + X + X + [x 
s^ + (\ + \' + ijl')s + XX' 


(4.41) 


For an ordinary parallel system, X' = 2X and \i = ji, and substitution into Eq. 
(4.41), long division of the denominator into the numerator, and inversion of 
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TABLE 4.7 Comparison of the Initial Behavior for Several 
Voting and Parallel Systems with Repair 


System 

Initial Reliability 
Equation, 
pi = 10X 

Value of t 
at which 

R = 0.999 

TMR with repair 

1 - 3(Xf) 2 + 15(Xf) 3 

0.0192 

X 

5MR with repair 

1 - 10(Xf) 3 + 80(Xf) 4 

0.057 

X 

Two parallel 

1 - (Xf) 2 + 4.33(Xf) 3 

0.034 

X 

Two standby 

1 - 0.5(Xf) 2 + 2(Xf) 3 

0.045 

X 


the transform (as was done previously) yields 

/Wei(?) = 1 - (Xf) 2 + \ 2 (3X + W/3 (4.42a) 

For a standby system, X' = X and pT =pi, and substitution into Eq. (4.41), long 
division, and inversion of the transform yields 

/^standby (?) = 1 ~ (X?) 2 /2 + X 2 (2X + pi)f 3 /6 (4.42b) 

Equations (4.42a) and (4.42b) appear in Table 4.7 along with Eqs. (4.27d) and 
(4.39b), where pi =10X has been substituted. 

Table 4.7 shows that the length of time the reliability takes to decay from 1 
to 0.999, which makes it clearly a high-reliability region. For the TMR system, 
the duration is t = 0.0192X; for the 5-level voting system, t= 0.057X. Thus the 
5-level system represents an increase of nearly 3 over the 3-level system. One 
can better appreciate these numerical values if typical values are substituted for 
X. The length of a year is 8,766 hours, which is often approximated as 10,000 
hours. A high-reliability computer may have an MTTF(1/X) of about 10 years, 
or approximately 100,000 hours. Substituting this value for t shows that the 
reliability of a TMR system with a repair rate of 10 times the failure rate will 
have a reliability exceeding 0.999 for about 1,920 hours. Similarly, a 5-level 
voting system will have a reliability exceeding 0.999 for about 5,700 hours. 
In the case of the parallel and standby systems, the high-reliability region is 
longer than in a TMR system, but is less than in a 5-level voter system. 

Higher-Level Voting. One could extend the above analysis to cover higher- 
level voting systems; for example, 7-level and 9-level voting. Even though it 
is easy to replicate many different copies of a logic circuit on a chip at low 
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cost, one seldom goes beyond the 3-level or 5-level voting system, although 
the foregoing methods could be used to solve for the reliability of such higher- 
level systems. 

If one fabricates a very large scale integrated circuit (VLSI) with many cir¬ 
cuits and a voter, an interesting question arises. There is a yield problem with 
complex chips caused by imperfections. With so much redundancy, how can 
one be sure that the chip does not contain such imperfections that a 5-level 
voter system with imperfections is really equivalent to a 4- or 3-level voter 
system? In fact, a 5-level voter system with two failed circuits is actually infe¬ 
rior to a 3-level voter. One more failure in the former will result in three failed 
and two good circuits, and the voter believes the failed three. In the case of a 
3-level voter, a single failure will still leave the remaining two good circuits 
in control. The solution is to provide internal test inputs on an IC voter system 
so that the components of the system can be tested. This means that extra pins 
on the chip must be dedicated to test points. The extra outputs in Fig. 4.10 
could provide these test points, as was discussed in Section 4.6.2. 

The next section discusses the effect of voter reliability on V-modular redun¬ 
dancy. Note that we have not discussed the effects of coverage in a TMR sys¬ 
tem. In general, the simple nature of a voter catches almost all failures, and 
coverage is not significant in modeling the system. 


4.8 A-MODULAR REDUNDANCY WITH REPAIR AND 
IMPERFECT VOTERS 

4.8.1 Introduction 

The analysis of the preceding section did not include two imperfections in a 
voting system: the reliability of the voter itself and also the concept of cover¬ 
age. In the case of parallel and standby systems, which were treated in Chapter 
3, coverage made a considerable difference in the reliability. The circuit that 
detected failures of the active system and switched to the standby (hot or cold) 
element in a parallel or standby system is reasonably complex and will have 
a significant failure rate. Furthermore, it will have the problem that it cannot 
detect all faults and will sometimes fail to switch when it should or switch 
when it should not. In the case of a voter, the concept and the resulting circuit 
is much simpler. Thus one might be justified in assuming that the voter does 
not have a coverage problem and so reduce our evaluation to the reliability of 
a voter and how it affects the system reliability. This can then be contrasted 
with the reliability of a coupler and a parallel system (introduced in Section 
3.5). 

4.8.2 Voter Reliability 

We begin our discussion of voter reliability by considering the reliability of 
a TMR system as shown in Fig. 4.1 and the reliability expression given in 
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Eq. (4.19). In Section 4.5, we asked how small the voter reliability, p v , can 
be so that the gains of TMR still exceed the reliability of a single circuit. 
The analysis was given in Eqs. (3.34) and (3.35). Now, we perform a similar 
analysis for a TMR system with an imperfect voter. The computation proceeds 
from a consideration of Eq. (4.19). If the voter were perfect, p v = 1, then the 
reliability would be computed as 


#TMR = 3p 2 c ~ 2 pi (4.43a) 

If we include an imperfect voter, this expression becomes 

Rjmr = 3 PvP 2 c ~ 2 p v pl = PvOp] ~ 2 pi) (4.43b) 

If we assume constant-failure rates for the voter and the circuits in the TMR 
configuration, then for the voter we have p v = e Krl , and for the TMR circuits, 
p = <? . If we use a three-term approximation for the exponential and sub¬ 

stitute into Eq. (4.43b), one obtains an expression for the initial reliability, as 
follows: 


Rtmr — I 1 - X„t + 


(X v t) 2 (\ v t) 3 


X 


- 2 1 - 3Xf + 


2! 3! 

(3Xt) 2 (3\„0 3 


3 1- 2 \,t + 


(2 \t) 2 (2 Xt) 3 


2 ! 


3! 


(4.44a) 


2! 3! 

Expanding the preceding equation and retaining only the first four terms yields 


f^TMR - 1 - X[T + ——-3(Xf)~ 


(4.44b) 


Furthermore, we are mainly interested in the cases where X v < X; thus we can 
omit the third term (which is a second-order term in X ( ,) and obtain 

RTMR=l-Kt-3(\r) 2 (4.44c) 

If we want the effect of the voter to be negligible, we let \ v t < 3(Xr) 2 , 

^2 <3Xr (4.45) 

X 


One can compare this result with that given in Eq. (3.35) for two parallel sys¬ 
tems by setting n =2, yielding 
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(3.35) 


The approximate result is that the coupler must have a failure rate three times 
smaller than that of the voter for the same decrease in reliability. 

One can examine the effect of repair on the above results by examining 
Eq. (4.27d) and Eq. (4.42). In both cases, the effect of the repair rate does 
not appear until the cubic term is encountered. The above comparisons only 
involved the linear and quadratic terms, so the effect of repair would only 
become apparent if the repair rate were very large and the time interval of 
interest were extended. 

4.8.3 Comparison of TMR, Parallel, and Standby Systems 

Another advantage of voter reliability over parallel and standby reliability is 
that there is a straightforward scheme for implementing voter redundancy (e.g., 
Fig. 4.8). Of course, one can also make redundant couplers for parallel or 
standby systems, but they may be more complex than redundant voters. 

It is easy to make a simple model for Fig. 4.8. Assume that the voters fail 
so that their outputs are stuck-at-zero or stuck-at-one and that voter failures 
do not corrupt the outputs of the circuits that feed the voters (e.g., A\, B\, and 
Ci). Assume just a single stage (Ai, B\, and Ci) and a single redundant voter 
system (V), V\. and V"). The voter works if two or three of the three voters 
work. Thus this is the same formula for TMR systems, and the reliability of 
the system becomes 


(3 P 2 C ~2p 3 c )x (3 pi - 2pi) 


(4.46) 


/' I MR X A\ old — 


It is easy to evaluate the advantages of redundant voters. Assume that p c = 
0.9 and that the voter is 10 times as reliable: (1 - p c ) =0.1, (1 - p v ) =0.01, 
and p v = 0.99. With a single voter, R = 0.99[3(0.9) 2 - 2(0.9) 3 ] =0.99 x 0.972 
= 0.962. In the case of a redundant voter, we have [3(0.99) 2 - 2(0.99) 3 ] x 
[3(0.9) 2 - 2(0.9) 3 ] =0.999702 x 0.972 = 0.9717. The redundant voter is thus 
significant; if the voter is less reliable, voter redundancy is even more effective. 
Assume that p v = 0.95; for a single voter, R =0.95 [3(0.9) 2 - 2(0.9) 3 ] = 0.95 x 
0.972=0.923. In the case of a redundant voter, we have [3(0.95) 2 - 2(0.95) 3 ] 
x [3(0.9) 2 - 2(0.9) 3 ] =0.99275 x 0.972 = 0.964953. 

The foregoing calculations and discussions were performed for a TMR cir¬ 
cuit with a single voter or redundant voters. It is possible to extend these com¬ 
putations to the subsystem level for a system such as that depicted in Fig. 4.8. 
In addition, one can repair a failed component of a redundant voter; thus one 
can use the analysis techniques previously derived for TMR and 5MR systems 
where the systems and voters can both be repaired. However, repair of voters 
really begs a larger question: How will we modularize the system architecture? 


Team-Ffy * 
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Assume one is going to design the system architecture with redundant voters 
and voting at a subsystem level. If the voters are to be placed on a single chip 
along with the circuits, then there is no separate repair of a voter system—only 
repair of the circuit and voter subsystem. The alternative is to make a separate 
chip for the N circuits and a separate chip for the redundant voter. The proper 
strategy to choose depends on whether there will be scheduled downtime for 
the system during which testing and replacement can occur and also whether 
the chips have sufficient test points. No general conclusion can be reached; the 
system architecture should be critiqued with these issues in mind. 


4.9 AVAILABILITY OF V-MODULAR REDUNDANCY WITH 
REPAIR AND IMPERFECT VOTERS 

4.9.1 Introduction 

When repair is present in a system, it is often possible for the system to fail 
and be down for a short period of time without serious operational effects. 
Suppose a computer used for electronic funds transfers is down for a short 
period of time. This is not catastrophic if the system is designed so that it can 
tolerate brief outages and perform the funds transfers at a later time period. If 
the system is designed to be self-diagnostic, and if a technician and a replace¬ 
ment plug in boards are both available, the machine can be restored quickly 
to operational status. For such systems, availability is a useful measure of sys¬ 
tem performance, as with reliability, and is the probability that the system is 
up at any point in time. It can be measured during operation by recording the 
downtimes and operating times for several failure and repair cycles. The avail¬ 
ability is given by the ratio of the sum of the uptimes for the system divided 
by the sum of the uptimes and the downtimes. (Formally, this ratio becomes 
the availability in the limit as the system operating time approaches infinity.) 
The availability A(t) is the probability that the system is up at time t, which 
can be written as a sum of probabilities: 


A(t) = Pino failures) + Aone failure + one repair) 

+ At wo failures + two repairs) 

+- 1 - P(n failures + n repairs) + • • • (4.47) 


Availability is always higher than reliability, since the first term in Eq. (4.47) 
is the reliability and all the other terms are positive numbers. Note that only 
the first few terms in Eq. (4.47) are significant for a moderate time interval 
and higher-order terms become negligible. Thus one could evaluate availability 
analytically by computing the terms in Eq. (4.47); however, the use of the 
Markov model simplifies such a computation. 
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1 -3XAr 


1 - (2X + n)A t 


1 



juA t 


nAr 


Zero failures 


One failure 


Two or three failures 


s 



s 2 = x t x 2 x 3 + x 1 x 2 x 3 
+ x 3 x 2 x 3 + x 3 x 2 x 3 


Figure 4.13 A Markov availability model for a TMR system with repair. 

4.9.2 Markov Availability Models 

A brief introduction to availability models appeared in Section 3.8.5; such com¬ 
putations will continue to be used in this section, and availabilities for TMR 
systems, parallel systems, and standby systems will be computed and com¬ 
pared. As in the previous section, we will make use of the fact that the Markov 
availability model given in Fig. 3.16 will hold with minor modifications (see 
Fig. 4.13). In Fig. 3.16, the value of X' is either one or two times X, but in the 
case of TMR, it is three times X. For the second transmission between 5i and 
52 for the TMR system, there are two possibilities of failure; thus the transition 
rate is 2X. Since there is only one repairman, the repair rate is ji. 

A set of Markov equations can be written that will hold for two in parallel 
and two in standby, as well as for TMR. The algorithm used in the preceding 
chapter will be employed. The terms 1 and At are deleted from Fig. 4.13. The 
time derivative of the probability of being in state so is set equal to the “flows” 
from the other nodes; for example, -\'P S0 (t) is from the self-loop and jx'P sl (t) 
is from the repair branch. Applying the algorithm to the other nodes and using 
algebraic manipulation yields the following: 


Pso(t) + X'/VO = n'Psiit) 


(4.48a) 

(4.48b) 

(4.48c) 

(4.48d) 


P Sl (t) + (X + n')P sl (t) = \'P S0 (t ) + (i"P X2 (t) 


P sl (t) + ii"P S2 (t)=\P sl (t) 
Ps 0 ( 0) = 1 F,,(0) = F S2 (0) = 0 


The appropriate values of parameters for this set of equations is given in Table 
4.8. A complete solution of these equations is given in Shooman [1990, pp. 
344—347], We will use the Laplace transform theorems previously introduced 
to simplify the solution. 

The Laplace transforms of Eqs. (4.48a-d) become 
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TABLE 4.8 Parameters of Eqs. (4.48a-d) for Various Systems 


System 

X 

X' 

p. 

rr 

n 

Two in parallel 

X 

2X 

\x 

JU 

Two standby 

X 

X 

n 

IX 

TMR 

2X 

3X 

\x 

IX 


(s + \')P S0 (s) -fJi'PsM =1 (4.49a) 

-\'Pso(s) + s(s + X + ix')P sl (s) -jx"P S2 (s) = 0 (4.49b) 

-X P SI (s) + (s + n")P, 2 (s) = 0 (4.49c) 


In the case of a system composed of two in parallel, two in standby, or 
TMR, the system is up if it is in state .so or state ,S |. The availability is thus 
the sum of the probabilities of being in one of these two states. If one uses 
Cramer’s rule or a similar technique to solve Eqs. (4.49a-c), one obtains a 
ratio of polynomials in s for the availability: 


Ms) = P S0 (s) + P si (s) 


s 2 + (X + X' + // + ix")s + ( X'/x" + 1 u'n") 
s[s 2 + (\ + \' + ix' + n")s + (XX' + X>" + m'm")] 


(4.50) 

Before we begin applying the various Laplace transform theorems to this 
availability function, we should discuss the nature of availability and what sort 
of analysis is needed. In general, availability always starts at 1 because the sys¬ 
tem is always assumed to be up at r = 0. Examination of Eq. (4.47) shows that 
initially near t = 0, the availability is just the reliability function that of course 
starts at 1. Gradually, the next term Atone failure and one repair) becomes 
significant in the availability equation; as time progresses, other terms in the 
series contribute. Although the overall effect based on the summation of these 
many terms is hard to understand, we note that they generally lead to a slow 
decay of the availability function to some steady-state value that is reasonably 
close to 1. Thus the initial behavior of the availability function is not as impor¬ 
tant as that of the reliability function. In addition, the MTTF is not always a 
significant measure of system behavior. The one measure of interest is the final 
value of the availability function. If the availability function for a particular 
system has an initial value of unity at t = 0 and decays slowly to a steady-state 
value close to unity, this system must always have a high value of availability, 
in which case the final value is a lower bound on the availability. Examining 
Table B7 in Appendix B, Section B8.1, we see that the final value and ini¬ 
tial value theorems both depend on the limit of sF(s) [in our case, sA(s) | as s 
approaches 0 and °°. The initial value is when .v approaches Examination 
of Eq. (4.50) shows that multiplication of A(s) by .v results in a cancellation of 
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TABLE 4.9 Comparison of the Steady-State Availability, Eq. (4.50) for Various 
Systems 


System 

Eq. (4.50) 

M = X 

o 

II 

=1 

M = 100X 

Two in parallel 

m(2X + m) 

2\~ + 2X/r + V 

0.6 

0.984 

0.9998 

Two standby 

m(X + m) 

X“ + X|U + 

0.667 

0.991 

0.9999 

TMR 

m(3X + m) 

6X 2 + 3 \fx + n~ 

0.4 

0.956 

0.9994 


the multiplying s term in the denominator. As s approaches infinity, both the 
numerator and denominator polynomials approach s 2 ; thus the ratio approaches 
1, as it should. However, to find the final value, we let s approach zero and 
obtain the ratio of the two constant terms given in Eq. (4.51). 


A(steady state) 


(X>" + mV") 
(XX' + XV" + mV") 


(4.51) 


The values of the parameters given in Table 4.8 are substituted in this equation, 
and the steady-state availabilities are compared for the three systems noted in 
Table 4.9. 

Clearly, the Laplace transform has been of great help in solving for steady- 
state availability and is superior to the simplified time-domain method: (a) let 
all time derivatives equal 0; (b) delete one of the resulting algebraic equations; 
(c) add the equation’s sum of all probabilities to equal 1; and (d) solve (see 
Section B7.5). 

Table 4.9 shows that the steady-state availability of two elements in standby 
exceeds that of two parallel items by a small amount, and they both exceed 
the TMR system by a greater margin. In most systems, the repair rate is much 
higher than the failure, so the results of the last column in the table are probably 
the most realistic. Note that these steady-state availabilities depend only on the 
ratio m/X. Before one concludes that the small advantages of one system over 
another in the table are significant, the following factors should be investigated: 


• It is assumed that a standby element cannot fail when it is in standby. 
This is not always true, since batteries discharge in standby, corrosion 
can occur, insulation can break down, etc., all of which may significantly 
change the comparison. 

• The reliability of the coupling device in a standby or parallel system is 
more complex than the voter reliability in a TMR circuit. These effects 
on availability may be significant. 

• Repair in any of these systems is predicated on knowing when a system 
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has failed. In the case of TMR, we gave a simple logic circuit that would 
detect which element has failed. The equivalent detection circuit in the 
case of a parallel or standby system is more complex and may have poorer 
coverage. 

Some of these effects are treated in the problems at the end of this chapter. 
It is likely, however, that the detailed design of comparative systems must be 
modeled to make a comprehensive comparison. 

A simple numerical example will show the power of increasing system 
availability using parallel and standby system configurations. In Section 3.10.1, 
typical failure and repair information for a circa-1985 transaction-processing 
system was quoted. The time between failures of once every two weeks trans¬ 
lates into a failure rate X = 1/(2 x 168) = 2.98 x |() 5 failures/hour, and the 
time to repair of one hour becomes a repair rate fx = 1 repair /hour. These val¬ 
ues were shown to yield a steady-state availability of 0.997—a poor value for 
what should be a highly reliable system. If we assume that the computer system 
architecture will be configured as a parallel system or a standby system, we 
can use the formulas of Table 4.9 to compute the expected increase in avail¬ 
ability. For an ordinary parallel system, the steady-state availability would be 
0.999982; for a standby system, it would be 0.9999911. Both translate into 
unavailability values A =1 - A of 1.8 x |() 5 and 8.9 x 10 6 . The unavail¬ 
ability of the single system would of course be 3 x | () 5 . The steady-state 
availability of the Stratus system was discussed in Section 3.10.2 and, based 
on claimed downtime, was computed as 0.9999905, which is equivalent to an 
unavailability of 95 x 10 7 . In Section 3.10.1, the Tandem unavailability, based 
on hypothetical goals, was 4 x 10 6 . Comparison of these four unavailability 
values yields the following: (a) for a single system, 3,000 x 10 6 ; (b) for a 
parallel system, 18 x 10 6 ; (c) for a standby system, 8.9 x 10~ 6 ; (d) for a Stra¬ 
tus system, 9.5 x 10~ 6 ; and (e) for a Tandem system, 4 x 10~ 6 . Also compare 
the Bell Labs’ ESS switching system unavailability goals and demonstrated 
availability of 5.7 x 10~ 6 and 3.8 x 10~ 6 . (See Table 1.4.) Of course, more 
definitive data or complete models are needed for detailed comparisons. 

4.9.3 Decoupled Availability Models 

A simplified technique can be used to compute the steady-state value of avail¬ 
ability for parallel and TMR systems. Availability computations really involve 
the evaluation of certain conditional probabilities. Since conditional probabil¬ 
ities are difficult to deal with, we introduced the Markov model computation 
technique. There is a case in which the dependent probabilities become inde¬ 
pendent and the computations simplify. We will introduce this case by focusing 
on the availability of two parallel elements. 

Assume that we wish to compute the steady-state availability of two par¬ 
allel elements, A and B. The reliability is the probability of no system fail¬ 
ures in interval 0 to t, which is the probability that either A or B is good, 



184 ^-MODULAR REDUNDANCY 


P(A g + B,j) = P(A g ) + P(Bg) — P(AgBg). The subscript “g” means that the ele¬ 
ment is good, that is, has not failed. Similarly, the availability is the prob¬ 
ability that the system is up at time t, which is the probability that either 
A or B is up, P(A up + B up ) = P(A up ) + P(B up ) = P(A up B up ). The subscript 
“up” means that the element is up, that is, is working at time t. The prod¬ 
uct terms in each of the above expressions, P(A g B g ) = P(A g )P(B g \\ g ) and 
P(A up 5 U p) = P(A up )P(B up \A up ) are the conditional probabilities discussed pre¬ 
viously. If there are two repairmen—one assigned to component A and one 
assigned to component B —the events (Bg\A g ) and CB up |A up ) become decou¬ 
pled, that is, the events are independent. The coupling (dependence) comes 
from the repairmen. If there is only one repairman and element A is down 
and being repaired, then if element B fails, it will take longer to restore B to 
operation; the repairman must first finish fixing A before working on B. In the 
case of individual repairmen, there is no wait for repair of the second element 
if two items have failed because each has its own assigned repairman. In the 
case of such decoupling, the dependent probabilities become independent and 
P(B,\Ag) = P(B g ) and -P(5 U p|^up) = P(B up ). This represents considerable sim¬ 
plification; it means that one can compute P{B g ), P(A g ), P(B up ), and P(A up ) 
separately and substitute into the reliability or availability equation to achieve 
a simple solution. Before we apply this technique and illustrate the simplicity 
of the solution, we should comment that because of the high cost, it is unlikely 
that there will be two separate repairmen. However, if the repair rate is much 
larger than the failure rate, p » X, the decoupled case is approached. This is 
true since repairs are relatively fast and there is only a small probability that 
a failed element A will still be under repair when element B fails. For a more 
complete discussion of this decoupled approximation, consult Shooman [1990, 
pp. 521-529]. 

To illustrate the use of this approximation, we calculate the steady-state 
availability of two parallel elements. In the steady state, 


Afsteady state) = P(A SS ) + P(B SS ) - P(A SS )P(B SS ) 
The steady-state availability for a single element is given by 


A 


SS 


X + [A 


(4.52) 


(4.53) 


One can verify this formula by reading the derivation in Appendix B, Sec¬ 
tions B7.3 and B7.4, or by examining Fig. 3.16. We can reduce Fig. 3.16 to a 
single element model by setting X = 0 to remove state 5 2 and letting X' =X and 
p' = p. Solving Eqs. (3.71a, b) for P SQ (t) and applying the final value theorem 
(multiply by .v and let s approach 0) also yields Eq. (4.53). If A and B have 
identical failure and repair rates, substitution of Eq. (4.53) into Eq. (4.52) for 
both A ss and B ss yields 



AVAILABILITY OF A-MODULAR REDUNDANCY WITH REPAIR 185 


2p, / n \ 2 p,(2X + jx) 

\ + fx \ X + fx J (X + 


(4.54) 


If we compare this result with the exact one in Table 4.9, we see that the 
numerator is the same and the denominator differs only by a coefficient of two 
in the X 2 term. Furthermore, since we are assuming that p, » X, the difference 
is very small. 

We can repeat this simplification technique for a TMR system. The TMR 
reliability equation is given by Eq. (4.2), and modification for computing the 
availability yields 


Afsteady state) = [P(A SS )\ 2 [3 - P(A SS )\ (4.55) 

Substitution of Eq. (4.53) into Eq. (4.55) gives 


Afsteady state) 


(456) 


There is no obvious comparison between Eq. (4.56) and the exact TMR avail¬ 
ability expression in Table 4.9. However, numerical comparison will show that 
the formulas yield nearly equivalent results. 

The development of approximate expressions for a standby system requires 
some preliminary work. The Poisson distribution (Appendix A, Section A5.4) 
describes the probabilities of success and failure in a standby system. The sys¬ 
tem succeeds if there are no failures or one failure; thus the reliability expres¬ 
sion is computed from the Poisson distribution as 


R(standby) = P {0 failures) + P{ 1 failure) = e x? + \te Xr (4.57) 

If we wish to transform this equation in terms of the probability of success p 
of a single element, we obtain p = e Kl and \t —— In p. (See also Shooman 
[1990, p. 147].) Substitution into Eq. (4.57) yields 

R(standby) =p( 1 - In p) (4.58) 

Finally, substitution in Eq. (4.58) of the steady-state availability from Eq. (4.53) 
yields an approximate expression for the availability of a standby system as 
follows: 


Afsteady state) 


X + /x 


1 - In 


P 

X + fx 


(4.59) 


Comparing Eq. (4.59) with the exact expression in Table 4.9 is difficult 
because of the different forms of the equations. The exact and approximate 
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expressions are compared numerically in Table 4.10. Clearly, the approxima¬ 
tions are close to the exact values. The best way to compare availability num¬ 
bers, since they are all so close to unity, is to compare the differences with 
the unavailability 1 - A. Thus, in Table 4.10, the difference in the results 
for the parallel system is (0.99990197- 0.99980396)/(1 - 0.99980396) = 
0.49995, or about 50%. Similarly, for the standby system, the difference in 
the results is (0.999950823 - 0.999901)/(1 - 0.999901) =0.50326, which is 
also 50%. For the TMR system, the difference in the results is (0.999707852 
- 0.999417815)/(1 - 0.999417815) =0.498819—again, 50%. The reader will 
note that these results are good approximations, all approximations yield a 
slightly higher result than the exact value, and all are satisfactory for prelimi¬ 
nary calculations. It is recommended that an exact computation be made once a 
design is chosen; however, these approximations are always useful in checking 
more exact results obtained from analysis or a computer program. 

The foregoing approximations are frequently used in industry. However, it 
is important to check their accuracy. The first reference known to the author 
of such approximations appears in Calabro [1962, pp. 136-139]. 

4.10 MICROCODE-LEVEL REDUNDANCY 

One can employ redundancy at the microcode level in a computer. 
Microcode consists of the elementary instructions that control the CPU or 
microprocessor—the heart of modern computers. Microinstructions perform 
such elementary operations as the addition of two numbers, the complement of 
a number, and shift left or right operations. When one structures the microcode 
of the computing chip, more than one algorithm can often be used to realize 
a particular operation. If several equivalent algorithms can be written, each 
one can serve the same purpose as the independent circuits in the V-modular 
redundancy. If the algorithms are processed in parallel, there is no reduction in 
computing speed except for the time to perform a voting algorithm. Of course, 
if all the algorithms use some of the same elements, and if those elements are 
faulty, the computations are not independent. One of the earliest works on 
microinstruction redundancy is Miller [1967]. 

4.11 ADVANCED VOTING TECHNIQUES 

The voting techniques described so far in this chapter have all followed a sim¬ 
ple majority voting logic. Many other techniques have been proposed, some 
of which have been implemented. This section introduces a number of these 
techniques. 

4.11.1 Voting with Lockout 

When Mmodular redundancy is employed and N is greater than three, addi¬ 
tional considerations emerge. Let us consider a 4-level majority voter as an 
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example. (This is essentially the same architecture that is embedded into the 
Space Shuttle’s primary flight control system—discussed in Chapter 5 as an 
example of software redundancy and shown in Fig. 5.19. Flowever, if we focus 
on the first four computers in the primary flight control system, we have an 
example of 4-level voting with lockout. The backup flight control system serves 
as an additional level of redundancy; it will be discussed in Chapter 5.) 

The question arises of what to do with a failed system when N is greater 
than three. To provide a more detailed discussion, we introduce the fact that 
failures can be permanent as well as transient. Suppose that hardware 5 in Fig. 
5.19 experiences a failure and we know that it is permanent. There is no reason 
to leave it in the circuit if we have a way to remove it. The reasoning is that 
if there is a second failure, there is a possibility that the two failed elements 
will agree and the two good elements will agree, creating a standoff. Clearly, 
this can be avoided if the first element is disconnected (locked out) from the 
comparison. In the Space Shuttle control system, this is done by an astronaut 
who has access to onboard computer diagnostic information and also by con¬ 
sultation with Ground Control, which has access to telemetered data on the 
control system. The switch shown at the output of each computer in Fig. 5.19 
is activated by an astronaut after appropriate deliberation and can be reversed 
at any time. NASA refers to this system as fail-safe-fail-operational, mean¬ 
ing that the system can experience two failures, can disconnect the two failed 
computers, and can have two remaining operating computers connected in a 
comparison arrangement. The flight rules that NASA uses to decide on safe 
modes of shuttle operation would rule on whether the shuttle must terminate 
a mission if only two valid computers in the primary system remain. In any 
event, there would clearly be an emergency situation in which the shuttle is 
still in orbit and one of the two remaining computers fails. If other tests could 
determine which computer gives valid information, then the system could con¬ 
tinue with a single computer. One such test would be to switch out one of the 
computers and see if the vehicle is still stable and handles properly. The com¬ 
puters could then be swapped, and stability and control can be observed for 
the second computer. If such a test identifies the failed computer, the system 
is still operating with one good computer. Clearly, with Ground Control and 
an astronaut dealing with an emergency, there is the possibility of switching 
back in a previously disconnected computer in the hope that the old failure 
was only a transient problem that no longer exists. Many of these cases are 
analyzed and compared in the following paragraphs. 

If we consider that the lockout works perfectly, the system will succeed if 
there are 0, 1, or 2 failures. The probability computation is simple using the 
binomial distribution. 

5(2 :4) = 5(4:4) + 5(3 :4) + 5(2: 4) 

= IP 4 ] + [V - 4 p 4 ] + [6 p 2 - 12p 3 + 6p 4 1 
= 3 p 4 - 8p 3 + 6 p 2 (4.60) 


Team-Fly 
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TABLE 4.11 Comparison of Reliabilities for Various Voting Systems 


Single Element 

TMR Voting 

Two-out-of-Four 

One-out-of-Four 

P 

pH 3 - 2 p) 

p 2 (3p 2 - 8p + 6) 

p(4p 2 - p 3 - 6p + 4) 

1 

1 

1 

1 

0.8 

0.896 

0.9728 

0.9984 

0.6 

0.648 

0.8208 

0.9744 

0.4 

0.352 

0.5248 

0.8704 

0.2 

0.104 

0.1808 

0.5904 

0 

0 

0 

0 


The reliability will be higher if we can detect and isolate a third failure. To 
compute the reliability, we start with Eq. (4.60) and add the binomial proba¬ 
bility 6(1 : 3) = (— p 4 + 4p i — 6 p 2 + 4p). The result is given in the following 
equation: 


6(1 : 4 ) = 6 ( 2 : 4 )+ 6(1 : 4 ) 

?= -p 4 + 4p 3 - 6p 2 + 4p (4.61) 

Note that deriving Eqs. (4.60) and (4.61) involves some algebra, and a sim¬ 
ple check on the result can help detect some common errors. We know that if 
every element in a system has failed, p = 0 and the reliability must be 0 regard¬ 
less of the system configuration. Thus, one necessary but not sufficient check 
is to substitute p = 0 in the reliability polynomial and see if the reliability is 0. 
Clearly both Eqs. (4.60) and (4.61) satisfy this requirement. Similarly, we can 
check to see that the reliability is 1 when p = 1. Again, both equations also 
satisfy this necessary check. Equations (4.60) and (4.61) are compared with 
a TMR system Eq. (4.43a) and a single element in Table 4.11 and Fig. 4.14. 
Note that the TMR voter is poorer than a single element for p < 0.5 but better 
than a single element for p > 0.5. 

4.11.2 Adjudicator Algorithms 

A comprehensive discussion of various voting techniques appears in McAl¬ 
lister and Vouk [1996]. The authors frame the discussion of voting based on 
software redundancy—the use of two or more independently developed ver¬ 
sions of the same software. In this book, A-version software is discussed in 
Sections 5.9.2 and 5.9.3. The more advanced voting techniques will be dis¬ 
cussed in this section since most apply to both hardware and software. 

McAllister and Vouk [1996] introduce a more general term for the voter 
element: an adjudicator, the underlying logic of which is the adjudicator algo¬ 
rithm. The adjudicator algorithm for majority voting (A-modular redundancy) 
is simply n + 1 or more agreements out of A = 2n + 1 elements (see also 
Section 4.4), where n is an integer greater than 0 (it is commonly 1 or 2). 
This algorithm is formulated for an odd number of elements. If we wish to 
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Element Success Probability, p 

Figure 4.14 Reliability comparison of the three voter circuits given in Table 4.11. 


also include even values of N, we can describe the algorithm as an m-out-of-iV 
voter, with N taking on any integer value equal to or larger than 3. The algo¬ 
rithm represents agreement if m or more element outputs agree and m is the 
integer, which is the ceiling function of (N + l)/2 written as m > [~(W + 1)/2~|. 
The ceiling function, [x], is the smallest integer that is greater than or equal 
to x (e.g., the roundup function). 

4.11.3 Consensus Voting 

If there is a sizable number of elements that process in parallel (hardware or 
software), then a number of agreement situations arise. The majority vote may 
fail, yet there may be agreement among some of the elements. An adjudication 
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algorithm can be defined for the consensus case, which is more complex than 
majority voting. Again, N is the number of parallel elements (N > 1) and k is 
the largest number of element outputs that agree. The symbol Ok denotes the 
set of A-element outputs that agree. In some cases, there can be more than one 
set of agreements, resulting in 0^., and the adjudication must choose between 
the multiple agreements. A flow chart is given in Fig. 4.15 that is based on 
the consensus voting algorithm in McAllister and Vouk [1996, p. 578]. 

If A = 1, there are obviously ties in the consensus algorithm. A similar situ¬ 
ation ensues if k > 1, but because there is more then one group with the same 
value of k, a tie-breaking algorithm must be used. One such algorithm is a random 
choice among the ties; another is to test the elements for correct operation, which 
in terms of software version consensus is called acceptance testing of the soft¬ 
ware. Initially, such testing may seem better suited to software than to hardware; 
in reality, however, such is not the case because hardware testing has been used in 
the past. The Carousel Inertial Navigation System used on the early Boeing 747 
and other aircraft had three stable platforms, three computers, and a redundancy 
management system that performed majority voting. One means of checking the 
validity of any of the computers was to submit a stored problem for solution and 
to check the results with a stored solution. The purpose was to help diagnose com¬ 
puter malfunctions and lock a defective computer out of the system. Also during 
the time when adaptive flight control systems were in use, some designs used test 
signals mixed with the normal control signals. By comparing the input test sig¬ 
nals and the output response, one could measure the parameters of the aircraft 
( the coefficients of the governing differential equations) and dynamically adjust 
the feedback signals for best control. 

4.11.4 Test and Switch Techniques 

The discussion in the previous section established the fact that hardware test¬ 
ing is possible in certain circumstances. Assuming that such testing has a high 
probability of determining success or failure of an element and that two or more 
elements are present, we can operate with element one alone as long as it tests 
valid. When a failure of element one is detected, we can switch to element two, 
etc. The logic of such a system differs little from that of the standby system 
shown in Fig. 3.12 for the case of two elements, but the detailed implementa¬ 
tion of test and switch may differ somewhat from the standby system. When 
these concepts are applied to software, the adjudication algorithm becomes an 
acceptance test. The switch to an earlier state of the process before failure was 
detected and the substitution of a second version of the software is called roll¬ 
back and recovery, but the overall philosophy is generally referred to as the 
recovery block technique. 

4.11.5 Pairwise Comparison 

We assume that the number of elements is divisible by two, that is, N = 2n, 
where n is an integer greater than one. The outputs of modules are compared 
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Figure 4.15 Flow chart based on the consensus voting algorithm in McAllister and 
Vouk [1996, p. 578]. 
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in pairs; if these pairs do not check, they are switched out of the circuit. The 
most practical application is where n = 2 and N = 4. For discussion purposes, 
we call the elements digital circuits A, 5, C, and D. Circuit A is compared with 
circuit 5; circuit C is compared with circuit D. The output of the AB pair is 
then compared with the output of the CD pair—an activity that I refer to as 
pairwise comparison. The software analog I call N self-checking programming. 
The reader should reflect that this is essentially the same logic used in the 
Stratus system fault detection described in Section 3.11. 

Assuming that all the comparitors are perfect, the pairwise comparison 
described in the preceding paragraph for N = 4 will succeed if (a), all four 
elements succeed ( ABCD ); (b), if three elements succeed (ABCD + ABCD + 
ABCD + ABCD)-, and (c), if two elements fail but in opposite pairs (ABCD + 
ABCD). In the case of (a), all elements succeed and no failures are present; 
in (b), on the other hand, the one failure means that one pair of elements dis¬ 
connects itself but that the remaining pair continues to operate successfully. 
There are six ways for two failures to occur, but only the two ways given in 
(c) mean that a single pair fails because one failure in each pair represents a 
system failure. If each of the four elements is identical with a probability of 
success of p, the probability of success can be obtained as follows from the 
binomial distribution: 

5(pairwise : 4) = 5(4 :4) + 5(3 :4) + (2/6)5(2:4) (4.62a) 

Substituting terms from Eq. (4.60) into Eq. (4.62a) yields 

5(pairwise : 4) = (p 4 ) + (4 p 3 - 4p 4 ) + (l/3)(6 p 1 - 12 p 2 + 6p 4 ) 

-p 2 (2 - p 2 ) (4.62b) 

Equation (4.62b) is compared with other systems in Table 4.12, where we see 
that the pairwise voting is slightly worse than it is for TMR. 

There are various other combinations of advanced voting techniques de- 


TABLE 4.12 Comparison of Reliabilities for Various Voting Systems 


Single Element 

Pairwise-out-of- 

Voting 

Two-out-of-Four 

TMR Voting 

P 

p 2 (2-p 2 ) 

p 2 (3p 2 - 8 p + 6) 

pH 3 - i P ) 

1 

1 

1 

l 

0.8 

0.8704 

0.9728 

0.896 

0.6 

0.590 

0.8208 

0.648 

0.4 

0.2944 

0.5248 

0.352 

0.2 

0.0784 

0.1808 

0.104 

0 

0 

0 

0 
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scribed by McAllister and Vouk [1996], who also compute and compare the 
reliability of many of these systems by assuming independent as well as depen¬ 
dent failures. 

4.11.6 Adaptive Voting 

Another technique for voting makes use of the fact that some circuit failure 
modes are intermittent or transient. In such a case, one does not wish to lock 
out (i.e., ignore) a circuit when it is behaving well (but when it is malfunc¬ 
tioning, it should be ignored). The technique of adaptive voting can be used to 
automatically switch between these situations [Pierce, 1961; Shooman, 1990, 
p. 324; Siewiorek, 1992, pp. 174, 178-182], 

An ordinary majority voter may be visualized as a device that takes the 
average of the outputs and gives a one output if the average is > 0.5 and a 
zero output if the average is < 0.5. (In the case of software outputs, a range of 
values not limited to the range 0-1 will occur, and one can deal with various 
point estimates such as the average, the arithmetic mean of the min and max 
values, or, as McAllister and Vouk suggest, the median.) An adaptive voter may 
be viewed as a weighted sum where each outpt x, is weighted by a coefficient. 
The coefficient a,- could be adjusted to equal the probability that the output x, 
was correct. Thus the test quantity of the adaptive voter (with an even number 
of elements) would be given by 


a l%i + CI2X2 + • • • + Cl2 n + \X2n+ 1 
0\ + 02 + ■ ■ ■ + Cl2n+l 


(4.63) 


The coefficients a,- can be adjusted dynamically by taking statistics on the 
agreement between each x t and the voter output over time. Another technique 
is to periodically insert test inputs and compare each output x\ with the known 
(i.e., precomputed) correct output. If some x\ is frequently in error, it should be 
disconnected. The adaptive voter adjusts a, to be a very small number, which 
is in essence the same thing. The reliability of the adaptive-voter scheme is 
superior to the ordinary voter; however, there are design issues that must be 
resolved to realize an adaptive voter in practice. 

The reader will appreciate that there are many choices for an adjudicator 
algorithm that yield an associated set of architectures. However, cost, volume, 
weight, and simplicity considerations generally limit the choices to a few of 
the simpler configurations. For example, when majority voting is used, it is 
generally limited to TMR or, in the case of the Space Shuttle example, 4-level 
voting with lockout. The most complex arrangement the author can remember 
is a 5-level majority logic system used to control the Apollo Mission’s main 
Saturn engine. For the Space Shuttle and Carousel navigation system exam¬ 
ples, the astronauts/pilots had access to other information, such as previous 
problems with individual equipment and ground-based measurements or obser¬ 
vations. Thus the accessibility of individual outputs and possible tests allow 
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human operators to organize a wide variety of behaviors. Presently, commer¬ 
cial airliners are switching from inertial navigation systems to navigation using 
the satellite-based Global Positioning System (GPS). Handheld GPS receivers 
have dropped in price to the $100-$200 range, so one can imagine every airline 
pilot keeping one in his or her flight bag as a backup. A similar trend occurred 
in the 1780s when pocket chronometers dropped in price to less than £65. 
Ship captains of the East India Company as well as those of the Royal Navy 
(who paid out of their own pockets) eagerly bought these accurate watches to 
calculate longitude while at sea [Sobel, 1995, p. 162]. 
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PROBLEMS 

4.1. Derive the equation analogous to Eq. (4.9) for a four-element majority 
voting scheme. 
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4.2. Derive the equation analogous to Eq. (4.9) for a five-element majority 
voting scheme. 

4.3. Verify the reliability functions sketched in Fig. 4.2. 

4.4. Compute the reliability of a 3-level majority voting system for the case 
where the failure rate is constant, X = 10 4 failures per hour, and t = 
1,000 hours. Compare this with the reliability of a single system. 

4.5. Repeat problem 4.4 for a 5-level majority voting system. 

4.6. Compare the results of problem 4.4 with a single system: two elements 
in parallel, two elements in standby. 

4.7. Compare the results of problem 4.5 with a single system: two elements 
in parallel, two elements in standby. 

4.8. What should the reliability of the voter be if it increases the probability 
of failure of the system of problem 4.4 by 10%? 

4.9. Compute the reliability at t = 1,000 hours of a system composed of a 
series connection of module 1 and module 2, each with a constant failure 
rate of Xi = 0.5 x 10 4 failures per hour. If we design a 3-level majority 
voting system that votes on the outputs of module 2, we have the same 
system as in problem 4.4. However, if we vote at the outputs of modules 
1 and 2, we have an improved system. Compute the reliability of this 
system and compare it with problem 4.4. 

4.10. Expand the reliability functions in series in the high-reliability region for 
the TMR 3-2-1 system and the TMR 3-2 system for the three systems 
of Fig. 4.3. [Include more terms than in Eqs. (4.14)—(4.16).] 

4.11. Compute the MTTF for the expansions of problem 4.10, compare these 
with the exact MTTF for these systems, and comment. 

4.12. Verify that an expansion of Eqs. (4.3a, b) leads to seven terms in addition 
to the term one, and that this leads to Eqs. (4.5a, b) and (4.6a, b). 

4.13. The approximations used in plotting Fig. 4.3 are less accurate for the 
larger values of \ t. Recompute the values using the exact expressions 
and comment on the accuracy of the approximations. 

4.14. Inspection of Fig. 4.4 shows that iV-modular redundancy is of no advan¬ 
tage over a single unit at t = 0 (they both have a reliability of 1) and at 
Xf = 0.69 (they both have a reliability of 0.5). The maximum advantage 
of V-modular redundancy is realized somewhere in between them. Com¬ 
pute the ratio of the V-modular redundancy given by Eq. (4.17) divided 
by the reliability of a single system that equals p. Maximize (i.e., dif¬ 
ferentiate this ratio with respect to p and set equal to 0) to solve for the 
value of p that gives the biggest improvement in reliability. Since p = e Kl , 
what is the value of X t that corresponds to the optimum value of pi 
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4.15. Repeat problem 4.14 for the case of component redundancy and majority 
voting as shown in Fig. 4.5 by using the reliability equation given in Eq. 
(4.18). 

4.16. Verify Grisamone’s results given in Table 4.1. 

4.17. Develop a reliability expression for the system of Fig. 4.8 assuming that 

(1) : All circuits A,-, B,, Cj, and the voters V, are independent circuits or 
independent integrated circuit chips. 

4.18. Develop a reliability expression for the system of Fig. 4.8 assuming that 

(2) : All circuits A t , B it and C, are independent circuits or independent 
integrated circuit chips and the voters V,, V', and V" are all on the same 
chip. 

4.19. Develop a reliability expression for the system of Fig. 4.8 assuming that 

(3) : All voters Vi, V', and V" are independent circuits or independent 
integrated circuit chips and circuits A,, B n and C, are all on the same 
chip. 

4.20. Develop a reliability expression for the system of Fig. 4.8 assuming that 

(4) : All circuits A,, and C, and all voters V,, V', and V" are all on 

the same chip. 

4.21. Section 4.5.3 discusses the difference between various failure models. 
Compare the reliability of a 1-bit TMR system under the following fail¬ 
ure model assumptions: 

(a) The failures are always s-a-1. 

(b) The failures are always s-a-0. 

(c) The circuits fail so that they always give the complement of the 
correct output. 

(d) The circuits fail at a transient rate A, and produce the complement 
of the correct output. 

4.22. Repeat problem 4.21, but instead of calculating the reliability, calculate 
the probability that any one transmission is in error. 

4.23. The circuit of Fig. 4.10 for a 32-bit word leads to a 512-gate circuit 
as described in this chapter. Using the information in Fig. B7, calculate 
the reliability of the voter and warning circuit. Using Eq. (4.19) and 
assuming that the voter reliability decreases the system reliability to 90% 
of what would be achieved with a perfect voter, calculate p c . Again using 
Fig. B7, calculate the equivalent gate complexity of the digital circuit in 
the TMR scheme. 

4.24. Repeat problem 4.10 for an /-level voting system. 

4.25. Drive a set of Markov equations for the model given in Fig. 4.11 and 
show that the solution of each equation leads to Eqs. (4.25a-c). 
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4.26. Formulate a four-state model related to Fig. 4.11, as discussed in the 
text, where the component states two failures and three failures are not 
merged but are distinct. Solve the model for the four-state probabilities 
and show that the first two states are identical with Eqs. (4.25a, b) and 
that the sum of the third and fourth states equals Eq. (4.25c). 

4.27. Compare the effects of repair on TMR reliability by plotting Eq. (4.27e), 
including the third term, with Eq. (4.27d). Both equations are to be plot¬ 
ted versus time for the cases where p = 10X, p = 25X, and p = 100X. 

4.28. Over what time range will the graphs in the previous problem be valid? 
(Flint: When will the next terms in the series become significant?) 

4.29. The logic function for a voter was simplified in Eq. (4.23) and Table 4.5. 
Suppose that all four minterms given in Table 4.5 were included without 
simplification, which provides some redundancy. Compare the reliability 
of the unminimized voter with the minimized voter (cf. Shooman [1990, 
p. 324]). 

4.30. Make a model for coupler reliability and for a TMR voter. Compare the 
reliability of two elements in parallel with that for a TMR. 

4.31. Repeat problem 4.30 when both systems include repair. 

4.32. Compare the MTTF of the systems in Table 3.4 with TMR and 5MR 
voter systems. 

4.33. Repeat problem 4.32 for Table 3.5. 

4.34. Compute the initial reliability for the systems of Tables 3.4 and 3.5 and 
compare with TMR and 5MR voter systems. 

4.35. Sketch and compare the initial reliabilities of TMR and 5MR Eqs. 
(4.27d) and (4.39b). Both equations are to be plotted versus time for 
the cases where p = 0, p = 10X, p =25X, and p = 100X. Note that for 
p = 100X and for points where the reliability has decreased to 0.99 or 
0.95, the series approximations may need additional terms. 

4.36. Check the values in Table 4.6. 

4.37. Check the series expansions and the values in Table 4.7. 

4.38. Plot the initial reliability of the four systems in Table 4.7. Calculate the 
next term in the series expansion and evaluate the time at which it rep¬ 
resents a 10% correction in the unreliability. Draw a vertical bar on the 
curve at this point. Repeat for each of the systems yielding a comparison 
of the reliabilities and a range of validity of the series expressions. 

4.39. Compare the voter circuit and reliability of (a) a TMR system, (b) a 
5MR system, and (c) five parallel elements with a coupler. Assume the 
voters and the coupler are imperfect. Compute and plot the reliability. 
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4.40. What time interval will be needed before the repair terms in the com¬ 
parison made in problem 4.39 become signihcant? 

4.41. It is assumed that a standby element cannot fail when it is in standby. 
However, this is not always true for many reasons; for example, batter¬ 
ies discharge in standby, corrosion can occur, and insulation can break 
down, all of which may signibcantly change the comparison. How large 
can the standby failure rate be and still be ignored? 

4.42. The reliability of the coupling device in a standby or parallel system is 
more complex than the voter reliability in a TMR circuit. These effects 
on availability may be significant. How large can the coupling failure 
rate be and still be ignored? 

4.43. Repair in any of these systems is predicted by knowing when a system 
has failed. In the case of TMR, we gave a simple logic circuit that would 
detect which element has failed. What is the equivalent detection circuit 
in the case of a parallel or standby system and what are the effects? 

4.44. Check the values in Table 4.9. 

4.45. Check the values in Table 4.10. 

4.46. Add another line to Table 4.10 for 5-level modular redundancy. 

4.47. Check the computations given in Tables 4.11 and 4.12. 

4.48. Determine the range of p for which the various systems in Table 4.11 
are superior to a single element. 

4.49. Determine the range of p for which the various systems in Table 4.12 
are superior to a single element. 

4.50. Explain how a system based on the adaptive voting algorithm of Eq. 
(4.63) will operate if 50% of all failures are transient and clear in a 
short period of time. 

4.51. Explain how a system based on the adaptive voting algorithm of Eq. 
(4.63) will operate if it is basically a TMR system and 50% of all element 
one failures are transient and 25% of all elements two and three failures 
are transient. 

4.52. Repeat and verify the availability computations in the last paragraph of 
Section 4.9.2. 

4.53. Compute the auto availability of a two-car family in which both the hus¬ 
band and wife need a car every day. Repeat the computation if a single 
car will serve the family in a pinch while the other car gets repaired. 
(See the brief discussion of auto reliability in Section 3.10.1 for failure 
and repair rates.) 

4.54. At the end of Section 4.9.2 before the final numerical example, three 
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factors not included in the model were listed. Discuss how you would 
model these effects for a more complex Markov model. 

4.55. Can you suggest any approximate procedures to determine if any of the 
effects in problem 4.54 are significant? 

4.56. Repeat problem 4.39 for the system availability. Make approximations 
where necessary. 

4.57. Repeat problem 4.30 for system availability. 

4.58. Repeat the derivation of Eq. (4.26c). 

4.59. Repeat the derivation of Eq. (4.37). 

4.60. Check the values given in Table 4.9. 

4.61. Derive Eq. (4.59). 
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SOFTWARE RELIABILITY AND 
RECOVERY TECHNIQUES 


5.1 INTRODUCTION 

The general approach in this book is to treat reliability as a system problem 
and to decompose the system into a hierarchy of related subsystems or com¬ 
ponents. The reliability of the entire system is related to the reliability of the 
components by some sort of structure function in which the components may 
fail independently or in a dependent manner. The discussion that follows will 
make it abundantly clear that software is a major “component” of the system 
reliability, 1 R. The reason that a separate chapter is devoted to software reli¬ 
ability is that the probabilistic models used for software differ from those used 
for hardware; moreover, hardware and software (and human) reliability can be 
combined only at a very high system level. (Section 5.8.5 discusses a macro¬ 
software reliability model that allows hardware and software to be combined at 
a lower level.) Specifically, if the hardware, software, and human failures are 
independent (often, this is not the case), one can express the system reliabil¬ 
ity, R sy , as the product of the hardware reliability, R H , the software reliability, 
R s , and the human operator reliability, R 0 . Thus, if independence holds, one 
can model the reliability of the various factors separately and combine them: 
Rsy = Rh x Rs x Ro [Shooman, 1983, pp. 351-353]. 

This chapter will develop models that can be used for the software reliabil¬ 
ity. These models are built upon the principles of continuous random variables 

'Another important “component” of system reliability is human reliability if an operator is 
involved in any control, monitoring, input, or similar task. A discussion of human reliability 
models is beyond the scope of this book; the reader is referred to Dougherty and Fragola [1988], 
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developed in Appendix A, Sections A6 and A7, and Appendix B, Section B3; 
the reader may wish to review these concepts while reading this chapter. 

Clearly every system that involves a digital computer also includes a signif¬ 
icant amount of software used to control system operation. It is hard to think 
of a modern business system, such as that used for information, transportation, 
communication, or government, that is not heavily computer-dependent. The 
microelectronics revolution has produced microprocessors and memory chips 
that are so cheap and powerful that they can be included in many commercial 
products. For example, a 1999 luxury car model contained 20-40 micropro¬ 
cessors (depending on which options were installed), and several models used 
local area networks to channel the data between sensors, microprocessors, dis¬ 
plays, and target devices [ New York Times, August 27, 1998]. Consumer prod¬ 
ucts such as telephones, washing machines, and microwave ovens use a huge 
number of embedded microcomponents. In 1997, 100 million microprocessors 
were sold, but this was eclipsed by the sale of 4.6 billion embedded microcom¬ 
ponents. Associated with each microprocessor or microcomponent is memory, 
a set of instructions, and a set of programs [Pollack, 1999]. 

5.1.1 Definition of Software Reliability 

One can define software engineering as the body of engineering and manage¬ 
ment technologies used to develop quality, cost-effective, schedule-meeting soft¬ 
ware. Software reliability measurement and estimation is one such technology 
that can be defined as the measurement and prediction of the probability’ that the 
software will perform its intended function (according to specifications) without 
error for a given period of time. Oftentimes, the design, programming, and test¬ 
ing techniques that contribute to high software reliability are included; however, 
we consider these techniques as part of the design process for the development 
of reliable software. Software reliability complements reliable software; both, in 
fact, are important topics within the discipline of software engineering. Software 
recovery is a set of fail-safe design techniques for ensuring that if some serious 
error should crash the program, the computer will automatically recover to reini¬ 
tialize and restart its program. The software succeeds during software recovery 
if no crucial data is lost, or if an operational calamity occurs, but the recovery 
transforms a total failure into a benign or at most a troubling, nonfatal “hiccup.” 

5.1.2 Probabilistic Nature of Software Reliability 

On first consideration, it seems that the outcome of a computer program is 
a deterministic rather than a probabilistic event. Thus one might say that the 
output of a computer program is not a random result. In defining the concept 
of a random variable, Cramer [Chapter 13, 1991] talks about spinning a coin as 
an experiment and the outcome (heads or tails) as the event. If we can control 
all aspects of the spinning and repeat it each time, the result will always be 
the same; however, such control needs to be so precise that it is practically 
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impossible to repeat the experiment in an identical manner. Thus the event 
(heads or tails) is a random variable. The remainder of this section develops 
a similar argument for software reliability where the random element in the 
software is the changing set of inputs. 

Our discussion of the probabilistic nature of software begins with an exam¬ 
ple. Suppose that we write a computer program to solve the roots r\ and r? 
of a quadratic equation, Ax 2 +Bx+ C = 0. If we enter the values 1, 5, and 6 
for A, B, and C, respectively, the roots will be r\ = -2 and ri = —3. A sin¬ 
gle test of the software with these inputs confirms the expected results. Exact 
repetition of this experiment with the same values of A, B, and C will always 
yield the same results, r\ = —2 and ri = —3, unless there is a hardware failure 
or an operating system problem. Thus, in the case of this computer program, 
we have defined a deterministic experiment. No matter how many times we 
repeat the computation with the same values of A, B, and C, we obtain the same 
result (assuming we exclude outside influences such as power failures, hard¬ 
ware problems, or operating system crashes unrelated to the present program). 
Of course, the real problem here is that after the first computation of r\ = —2 
and r 2 = -3 we do no useful work to repeat the same identical computation. 
To do useful work, we must vary the values of A, B, and C and compute the 
roots for other input values. Thus the probabilistic nature of the experiment, 
that is, the correctness of the values obtained from the program for r\ and r<i, 
is dependent on the input values A, B, and C in addition to the correctness of 
the computer program for this particular set of inputs. 

The reader can readily appreciate that when we vary the values of A, B, and 
C over the range of possible values, either during test or operation, we would 
soon see if the software developer achieved an error-free program. For exam¬ 
ple, was the developer wise enough to treat the problem of imaginary roots? 
Did the developer use the quadratic formula to solve for the roots? How, then, 
was the case of A =0 treated where there is only one root and the quadratic 
formula “blows up” (i.e., leads to an exponential overflow error)? Clearly, we 
should test for all these values during development to ensure that there are no 
residual errors in the program, regardless of the input value. This leads to the 
concept of exhaustive testing, which is always infeasible in a practical problem. 
Suppose in the quadratic equation example that the values of A, B, and C were 
restricted to integers between +1,000 and —1,000. Thus there would be 2,000 
values of A and a like number of values of B and C. The possible input space 
for A, B , and C would therefore be (2,000) 3 = 8 billion values. 2 Suppose that 


2 In a real-time system, each set of input values enters when the computer is in a different ‘"initial 
state,” and all the initial states must also be considered. Suppose that a program is designed to 
sum the values of the inputs for a given period of time, print the sum, and reset. If there is a 
high partial sum, and a set of inputs occurs with large values, overflow may be encountered. If 
the partial sum were smaller, this same set of inputs would therefore cause no problems. Thus, 
in the general case, one must consider the input space to include all the various combinations of 
inputs and states of the system. 
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we solve for each value of roots, substitute in the original equation to check, 
and only print out a result if the roots when substituted do not yield a zero 
of the equation. If we could process 1,000 values per minute, the exhaustive 
test would require 8 million minutes, which is 5,556 days or 15.2 years. This 
is hardly a feasible procedure: any such computation for a practical problem 
involves a much larger test space and a more difficult checking procedure that 
is impossible in any practical sense. In the quadratic equation example, there 
was a ready means of checking the answers by substitution into the equation; 
however, if the purpose of the program is to calculate satellite orbits, and if 
1 million combinations of input parameters are possible, then a person(s) or 
computer must independently obtain the 1 million right answers and check 
them all! Thus the probabilistic nature of software reliability is based on the 
varying values of the input, the huge number of input cases, the initial system 
states, and the impossibility of exhaustive testing. 

The basis for software reliability is quite different than the most common 
causes of hardware reliability. Software development is quite different from 
hardware development, and the source of software errors (random discovery 
of latent design and coding defects) differs from the source of most hard¬ 
ware errors (equipment failures). Of course, some complex hardware does have 
latent design and assembly defects, but the dominant mode of hardware fail¬ 
ures is equipment failures. Mechanical hardware can jam, break, and become 
worn-out, and electrical hardware can burn out, leaving a short or open circuit 
or some other mode of failure. Many who criticize probabilistic modeling of 
software complain that instructions do not wear out. Although this is a true 
statement, the random discovery of latent software defects is indeed just as 
damaging as equipment failures, even though it constitutes a different mode 
of failure. 

The development of models for software reliability in this chapter begins 
with a study of the software development process in Section 5.3 and continues 
with the formulation of probabilistic models in Section 5.4. 


5.2 THE MAGNITUDE OF THE PROBLEM 

Modeling, predicting, and measuring software reliability is an important quan¬ 
titative approach to achieving high-quality software and growth in reliabil¬ 
ity as a project progresses. It is an important management and engineering 
design metric; most software errors are at least troublesome—some are very 
serious—so the major flaws, once detected, must be removed by localization, 
redesign, and retest. 

The seriousness and cost of fixing some software problems can be appreci¬ 
ated if we examine the Year 2000 Problem (Y2K). The largely overrated fears 
occurred because during the early days of the computer revolution in the 1960s 
and 1970s, computer memory was so expensive that programmers used many 
tricks and shortcuts to save a little here and there to make their programs oper- 
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ate with smaller memory sizes. In 1965, the cost of magnetic-core computer 
memory was expensive at about $1 per word and used a significant operating 
current. (Presently, microelectronic memory sells for perhaps $ 1 per megabyte 
and draws only a small amount of current; assuming a 16-bit word, this cost 
has therefore been reduced by a factor of about 500,000!) To save memory, 
programmers reserved only 2 digits to represent the last 2 digits of the year. 
They did not anticipate that any of their programs would survive for more 
than 5-10 years; moreover, they did not contemplate the problem that for the 
year 2000, the digits “00” could instead represent the year 1900 in the soft¬ 
ware. The simplest solution was to replace the 2-digit year held with a 4-digit 
one. The problem was the vast amount of time required not only to search for 
the numerous instances in which the year was used as input or output data or 
used in intermediate calculations in existing software, but also to test that the 
changes have been successful and have not introduced any new errors. This 
problem was further exacerbated because many of these older software pro¬ 
grams were poorly documented, and in many cases they were translated from 
one version to another or from one language to another so they could be used 
in modern computers without the need to be rewritten. Although only minor 
problems occurred at the start of the new century, hundreds of millions of dol¬ 
lars had been expended to make a few changes that would only have been triv¬ 
ial if the software programs had been originally designed to prevent the Y2K 
problem. 

Sometimes, however, efforts to avert Y2K software problems created prob¬ 
lems themselves. One such case was that of the 7-Eleven convenience store 
chain. On January 1, 2001, the point-of-sale system used in the 7-Eleven stores 
read the year “2001” as “1901,” which caused it to reject credit cards if they 
were used for automatic purchases (manual credit card purchases, in addition 
to cash and check purchases, were not affected). The problem was attributed 
to the system’s software, even though it had been designed for the 5,200-store 
chain to be Y2K-compliant, had been subjected to 10,000 tests, and worked fine 
during 2000. (The chain spent 8.8 million dollars—0.1% of annual sales—for 
Y2K preparation from 1999 to 2000.) Fortunately, the bug was fixed within 1 
day [The Associated Press, January 4, 2001]. 

Another case was that of Norway’s national railway system. On the morning 
of December 31, 2000, none of the new 16 airport-express trains and 13 high¬ 
speed signature trains would start. Although the computer software had been 
checked thoroughly before the start of 2000, it still failed to recognize the 
correct date. The software was reset to read December 1, 2000, to give the 
German maker of the new trains 30 days to correct the problem. None of the 
older trains were affected by the problem [ New York Times, January 3, 2001]. 

Before we leave the obvious aspects of the Y2K problem, we should con¬ 
sider how deeply entrenched some of these problems were in legacy software'. 
old programs that are used in their original form or rejuvenated for extended 
use. Analysts have found that some of the old IBM 9020 computers used 
in outmoded components of air traffic control systems contain an algorithm 
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in their microcode for switching between the two redundant cooling pumps 
each month to even the wear. (For a discussion of cooling pumps in typi¬ 
cal IBM computers, see Siewiorek [1992, pp. 493, 504].) Nobody seemed to 
know how this calendar-sensitive algorithm would behave in the year 2000! 
The engineers and programmers who wrote the microcode for the 9020s had 
retired before 2000, and the obvious answer—replace the 9020s with modern 
computers—proceeded slowly because of the cost. Although no major prob¬ 
lems occurred, the scare did bring to the attention of many managers the poten¬ 
tial problems associated with the use of legacy software. 

Software development is a lengthy, complex process, and before the focus of 
this chapter shifts to model building, the development process must be studied. 


5.3 SOFTWARE DEVELOPMENT LIFE CYCLE 

Our goal is to make a probabilistic model for software, and the first step in any 
modeling is to understand the process [Boehm, 2000; Brooks, 1995; Pfleerer, 
1998; Schach, 1999; and Shooman, 1983]. A good approach to the study of the 
software development process is to define and discuss the various phases of 
the software development life cycle. A common partitioning of these phases 
is shown Table 5.1. The life cycle phases given in this table apply directly 
to the technique of program design known as structured procedural program¬ 
ming (SPP). In general, it also applies with some modification to the newer 
approach known as object-oriented programming (OOP). The details of OOP, 
including the popular design diagrams used for OOP that are called the uni¬ 
versal modeling language (UMLs), are beyond the scope of this chapter; the 
reader is referred to the following references for more information: [Booch, 
1999; Fowler, 1999; Pfleerer, 1998; Pooley, 1999; Pressman, 1997; and Schach, 
1999]. The remainder of this section focuses on the SPP design technique. 

5.3.1 Beginning and End 

The beginning and end of the software development life cycle are the start 
of the project and the discard of the software. The start of a project is gen¬ 
erally driven by some event; for example, the head of the Federal Aviation 
Administration (FAA) or of some congressional committee decides that the 
United States needs a new air traffic control system, or the director of mar¬ 
keting in a company proposes to a management committee that to keep the 
company’s competitive edge, it must develop a new database system. Some¬ 
times, a project starts with a written needs document, which could be an inter¬ 
nal memorandum, a long-range plan, or a study of needed improvements in a 
particular held. The necessity is sometimes a business expansion or evolution; 
for example, a company buys a new subsidiary business and finds that its old 
payroll program will not support the new conglomeration, requiring an updated 
payroll program. The needs document generally specifies why new software is 
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TABLE 5.1 Project Phases for the Software Development Life Cycle 


Phase 

Description 

Start of project 

Initial decision or motivation for the project, including 
overall system parameters. 

Needs 

A study and statement of the need for the software and 
what it should accomplish. 

Requirements 

Algorithms or functions that must be performed, including 
functional parameters. 

Specifications 

Details of how the tasks and functions are to be 
performed. 

Design of prototype 
Prototype: System 
test 

Revision of 
specifications 

Final design 

Construction of a prototype, including coding and testing. 

Evaluation by both the developer and the customer of 
how well the prototype design meets the requirements. 

Prototype system tests and other information may reveal 
needed changes. 

Design changes in the prototype software in response to 
discovered deviations from the original specifications 
or the revised specifications, and changes to improve 
performance and reliability. 

Code final design 

Unit test 

The final implementation of the design. 

Each major unit (module) of the code is individually 
tested. 

Integration test 

Each module is successively inserted into the pretested 
control structure, and the composite is tested. 

System test 

Once all (or most) of the units have been integrated, 
the system operation is tested. 

Acceptance test 

The customer designs and witnesses a test of the system to 
see if it meets the requirements. 

Field deployment 
Field maintenance 
Redesign of the 
system 

The software is placed into operational use. 

Errors found during operation must be fixed. 

A new contract is negotiated after a number of years of 
operation to include changes and additional features. 

The aforementioned phases are repeated. 

Software discard 

Eventually, the software is no longer updated or corrected 
but discarded, perhaps to be replaced by new software. 


needed. Generally, old software is discarded once new, improved software is 
available. However, if one branch of an organization decides to buy new soft¬ 
ware and another branch wishes to continue with its present version, it may 
be difficult to define the end of the software’s usage. Oftentimes, the discard¬ 
ing takes place many years beyond what was originally envisioned when the 
software was developed or purchased. (In many ways, this is why there was 
a Y2K problem: too few people ever thought that their software would last to 
the year 2000.) 


Tcam-Flij 





SOFTWARE DEVELOPMENT LIFE CYCLE 209 


5.3.2 Requirements 

The project formally begins with the drafting of a requirements document for 
the system in response to the needs document or equivalent document. Initially, 
the requirements constitute high-level system requirements encompassing both 
the hardware and software. In a large project, as the requirements document 
“matures,” it is expanded into separate hardware and software requirements; 
the requirements will specify what needs to be done. For an air traffic control 
system (ATC), the requirements would deal with the ATC centers that they 
must serve, the present and expected future volume of traffic, the mix of air¬ 
craft, the types of radar and displays used, and the interfaces to other ATC 
centers and the aircraft. Present travel patterns, expected growth, and expected 
changes in aircraft, airport, and airline operational characteristics would also 
be reflected in the requirements. 


5.3.3 Specifications 

The project specifications start with the requirements and the details of how 
the software is to be designed to satisfy these requirements. Continuing with 
our air traffic control system example, there would be a hardware specifica¬ 
tions document dealing with (a) what type of radar is used; (b) the kinds of 
displays and display computers that are used; (c) the distributed computers or 
microprocessors and memory systems; (d) the communications equipment; (e) 
the power supplies; and (f) any networks that are needed for the project. The 
software specifications document will delineate (a) what tracking algorithm to 
use; (b) how the display information for the aircraft will be handled; (c) how 
the system will calculate any potential collisions; (d) how the information will 
be displayed; and (e) how the air traffic controller will interact with both the 
system and the pilots. Also, the exact nature of any required records of a tech¬ 
nical, managerial, or legal nature will be specified in detail, including how 
they will be computed and archived. Particular projects often use names dif¬ 
ferent from requirements and specifications (e.g., system requirements versus 
software specifications and high-level versus detailed specifications), but their 
content is essentially the same. A combined hardware-software specification 
might be used on a small project. 

It is always a difficult task to define when requirements give way to specifi¬ 
cations, and in the practical world, some specifications are mixed in the require¬ 
ments document and some sections of the specifications document actually 
seem like requirements. In any event, it is important that the why, the what, 
and the how of the project be spelled out in a set of documents. The complete¬ 
ness of the set of documents is more important than exactly how the various 
ideas are partitioned between requirements and specifications. 

Several researchers have outlined or developed experimental systems that 
use a formal language to write the specifications. Doing so has introduced a for¬ 
malism and precision that is often lacking in specifications. Furthermore, since 
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the formal specification language would have a grammar, one could build an 
automated specification checker. With some additional work, one could also 
develop a simulator that would in some way synthetically execute the specifi¬ 
cations. Doing so would be very helpful in many ways for uncovering missing 
specifications, incomplete specifications, and conflicting specifications. More¬ 
over, in a very simple way, it would serve as a preliminary execution of the 
software. Unfortunately, however, such projects are only in the experimental 
or prototype stages [Wing, 1990]. 

5.3.4 Prototypes 

Most innovative projects now begin with a prototype or rapid prototype phase. 
The purpose of the prototype is multifaceted: developers have an opportunity to 
try out their design ideas, the difficult parts of the project become rapidly appar¬ 
ent, and there is an early (imperfect) working model that can be shown to the cus¬ 
tomer to help identify errors of omission and commission in the requirements and 
specification documents. In constructing the prototype, an initial control struc¬ 
ture (the main program coordinating all the parts) is written and tested along with 
the interfaces to the various components (subroutines and modules). The various 
components are further decomposed into smaller subcomponents until the mod¬ 
ule level is reached, at which time programming or coding at the module level 
begins. The nature of a module is described in the paragraphs that follow. 

A module is a block of code that performs a well-described function or 
procedure. The length of a module is a frequently debated issue. Initially, its 
length was defined as perhaps 50-200 source lines of code (SLOC). The SLOC 
length of a module is not absolute; it is based on the coder’s “intellectual span 
of control.” Since a program listing contains about 50 lines, this means that a 
module would be 1-4 pages long. The reasoning behind this is that it would 
be difficult to read, analyze, and trace the control structures of a program that 
extend beyond a few pages and keep all the logic of the program in mind; 
hence the term intellectual span of control. The concept of a module, module 
interface, and rough bounds on module size are more directly applicable to an 
SPP approach than to that of an OOP; however, as with very large and complex 
modules, very large and complex objects are undesirable. 

Sometimes, the prototype progresses rapidly since old code from related 
projects can be used for the subroutines and modules, or a “first draft” of the 
software can be written even if some of the more complex features are left out. 
If the old code actually survives to the final version of the program, we speak 
of such code as reused code or legacy code, and if such reuse is significant, 
the development life cycle will be shortened somewhat and the cost will be 
reduced. Of course, the prototype code must be tested, and oftentimes when a 
prototype is shown to the customer, the customer understands that some fea¬ 
tures are not what he or she wanted. It is important to ascertain this as early 
as possible in the project so that revisions can be made in the specifications 
that will impact the final design. If these changes are delayed until late in 
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the project, they can involve major changes in the code as well as significant 
redesign and extensive retesting of the software, for which large cost overruns 
and delays may be incurred. In some projects, the contracting is divided into 
two phases: delivery and evaluation of the prototype, followed by revisions 
in the requirements and specifications and a second contract for the delivered 
version of the software. Some managers complain that designing a prototype 
that is to be replaced by a final design is doing a job twice. Indeed it is; how¬ 
ever, it is the best way to develop a large, complex project. (See Chapter 11, 
“Plan to Throw One Away,” of Brooks [1995].) The cost of the prototype is 
not so large if one considers that much of the prototype code (especially the 
control structure) can be modified and reused for the final design and that the 
prototype test cases can be reused in testing the final design. It is likely that 
the same manager who objects to the use of prototype software would heartily 
endorse the use of a prototype board (breadboard), a mechanical model, or 
a computer simulation to “work out the bugs” of a hardware design without 
realizing that the software prototype is the software analog of these well-tried 
hardware development techniques. 

Finally, we should remark that not all projects need a prototype phase. Con¬ 
sider the design of a fourth payroll system for a customer. Assume that the 
development organization specializes in payroll software and had developed 
the last three payroll systems for the customer. It is unlikely that a prototype 
would be required by either the customer or the developer. More likely, the 
developer would have some experts with considerable experience study the 
present system, study the new requirements, and ask many probing questions 
of the knowledgeable personnel at the customer’s site, after which they could 
write the specifications for the final software. However, this payroll example 
is not the usual case; in most cases, prototype software is generally valuable 
and should be considered. 

5.3.5 Design 

Design really begins with the needs, requirements, and specifications docu¬ 
ments. Also, the design of a prototype system is a very important part of 
the design process. For discussion purposes, however, we will refer to the 
final design stage as program design. In the case of SPP, there are two basic 
design approaches: top-down and bottom-up. The top-down process begins 
with the complete system at level 0; then, it decomposes this into a num¬ 
ber of subsystems at level 1. This process continues to levels 2 and 3, then 
down to level n where individual modules are encountered and coded as 
described in the following section. Such a decomposition can be modeled 
by a hierarchy diagram (H-diagram) such as that shown in Fig. 5.1(a). The 
diagram, which resembles an inverted tree, may be modeled as a mathe¬ 
matical graph where each “box” in the diagram represents a node in the 
graph and each line connecting the boxes represents a branch in the graph. 
A node at level k (the predecessor) has several successor nodes at level 
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Figure 5.1 (a). An H-diagram depicting the high-level architecture of a program to be used in designing the suspension system of a 
high-speed train, assuming that the dynamics can be approximately modeled by a third-order system (characteristic polynomial is a 
cubic); and (b), a graph corresponding to (a). 
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(k + 1) (sometimes, the terms ancestor and descendant or parent and child 
are used). The graph has no loops (cycles), all nodes are connected (you can 
traverse a sequence of branches from any node to any other node), and the 
graph is undirected (one can traverse all branches in either direction). Such a 
graph is called a tree (free tree) and is shown in Fig. 5.1(b). For more details 
on trees, see Cormen [p. 91 ff.]. 

The example of the Fl-diagram given in Fig. 5.1 is for the top-level archi¬ 
tecture of a program to be used in the hypothetical design of the suspension 
system for a high-speed train. It is assumed that the dynamics of the suspen¬ 
sion system can be approximated by a third-order differential equation and that 
the stability of the suspension can be studied by plotting the variation in the 
roots of the associated third-order characteristic polynomial (Ax 3 + Bx 1 2 + Cx 
+ D = 0), which is a function of the various coefficients A, B, C, and D. It is 
also assumed that the company already has a plotting program (4.1) that is to 
be reused. The block (4.2) is to determine whether the roots have any positive 
real parts, since this indicates instability. In a different design, one could move 
the function 4.2 to 2.4. Thus the H-diagram can be used to discuss differences 
in high-level design architecture of a program. Of course, as one decomposes 
a problem, modules may appear at different levels in the structure, so the H- 
diagram need not be as symmetrical as that shown in Fig. 5.1. 

One feature of the top-down decomposition process is that the decision of 
how to design lower-level elements is delayed until that level is reached in 
the design decomposition and the final decision is delayed until coding of the 
respective modules begins. This hiding process, called information hiding, is 
beneficial, as it allows the designer to progress with his or her design while 
more information is gathered and design alternatives are explored before a 
commitment is made to a specific approach. If at each level k the project is 
decomposed into very many subproblems, then that level becomes cluttered 
with many concepts, at which point the tree becomes very wide. (The number 
of successor nodes in a tree is called the degree of the predecessor node.) If the 
decomposition only involves two or three subproblems (degree 2 or 3), the tree 
becomes very deep before all the modules are reached, which is again cum¬ 
bersome. A suitable value to pick for each decomposition is 5-9 subprograms 
(each node should have degree 5-9). This is based on the work of the exper¬ 
imental psychologist Miller [1956], who found that the classic human senses 
(sight, smell, hearing, taste, and touch) could discriminate 5-9 logarithmic lev¬ 
els. (See also Shooman [1983, pp. 194, 195].) Using the 5-9 decomposition 
rule provides some bounds to the structure of the design process for an SPP. 

Assume that the program size is N source lines of code (SLOC) in length. 
If the graph is symmetrical and all the modules appear at the lowest level k, 
as shown in Fig. 5.1(a), and there are 5-9 successors at each node, then: 


1. All the levels above k represent program interfaces. 

2. At level 0, there are between 5° = 1 and 9° = 1 interfaces. At level 1, the 
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top level node has between 5 1 = 5 and 9 1 = 9 interfaces. Also at level 2 
are between 5 2 =25 and 9 2 = 81 interfaces. Thus, for k levels starting 

with level 0, the sum of the geometric progression r° + r 1 +r 2 + -t- r k is 

given by the equations that follow. (See Hall and Knight [1957, p. 39] 
or a similar handbook for more details.) 

Sum = (r k - l)/(r - 1) (5.1a) 

and for r = 5 to 9, we have 

(5 k - l)/4 < number of interfaces < (9 k - l)/8 (5.1b) 

3. The number of modules at the lowest level is given by 

5 k < number of modules < 9 k (5.1c) 

4. If each module is of size M, the number of lines of code is 

M x 5 k < number of SLOC <Mx9 k (5. Id) 

Since modules generally vary in size, Eq. (5.Id) is still approximately correct 
if M is replaced by the average value M. 

We can better appreciate the use of Eqs. (5.1a-d) if we explore the following 
example. Suppose that a module consists of 100 lines of code, in which case 
M = 100, and it is estimated that a program design will take about 10,000 
SLOC. Using Eq. (5.1c, d), we know that the number of modules must be 
about 100 and that the number of levels are bounded by 5 k =100 and 9 k = 
100. Taking logarithms and solving the resulting equations yields 2.09 < k < 
2.86. Thus, starting with the top-level 0, we will have about 2 or 3 successor 
levels. Similarly, we can bound the number of interfaces by Eq. (5.1b), and 
substitution of k = 3 yields the number of interfaces between 31 and 91. Of 
course, these computations are for a symmetric graph; however, they give us 
a rough idea of the size of the H-diagram design and the number of modules 
and interfaces that must be designed and tested. 

5.3.6 Coding 

Sometimes, a beginning undergraduate student feels that coding is the most 
important part of developing software. Actually, it is only one of the six¬ 
teen phases given in Table 5.1. Previous studies [Shooman, 1983, Table 5.1] 
have shown that coding constitutes perhaps 20% of the total development 
effort. The preceding phases of design—“start of project” through the “final 
design”—entail about 40% of the development effort; the remaining phases, 
starting with the unit (module) test, are another 40%. Thus coding is an impor¬ 
tant part of the development process, but it does not represent a large fraction 
of the cost of developing software. This is probably the first lesson that the 
software engineering field teaches the beginning student. 
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The phases of software development that follow coding are various types of 
testing. The design is an SPP, and the coding is assumed to follow the struc¬ 
tured programming approach where the minimal basic control structures are 
as follows: IF THEN ELSE and DO WHILE. In addition, most languages also 
provide DO UNTIL, DO CASE, BREAK, and PROCEDURE CALL AND 
RETURN structures that are often called extended control structures. Prior to 
the 1970s, the older, dangerous, and much-abused control structure GO TO 
LABEL was often used indiscriminately and in a poorly thought-out manner. 
One major thrust of structured programming was to outlaw the GO TO and 
improve program structure. At the present, unless a programmer must correct, 
modify, or adapt a very old (legacy) code, he or she should never or very sel¬ 
dom encounter a GO TO. In a few specialized cases, however, an occasional 
well-thought-out, carefully justified GO TO is warranted [Shooman, 1983]. 

Almost all modern languages support structured programming. Thus the 
choice of a language is based on other considerations, such as how familiar 
the programmers are with the language, whether there is legacy code available, 
how well the operating system supports the language, whether the code mod¬ 
ules are to be written so that they may be reused in the future, and so forth. 
Typical choices are C, Ada, and Visual Basic. In the case of OOP, the most 
common languages at the present are C++ and Ada. 

5.3.7 Testing 

Testing is a complex process, and the exact nature of it depends on the design 
philosophy and the phase of the project. If the design has progressed under a 
top-down structured approach, it will be much like that outlined in Table 5.1. 
If the modern OOP techniques are employed, there may be more testing of 
interfaces, objects, and other structures within the OOP philosophy. If proof of 
program correctness is employed, there will be many additional layers added to 
the design process involving the writing of proofs to ensure that the design will 
satisfy a mathematical representation of the program logic. These additional 
phases of design may replace some of the testing phases. 

Assuming the top-down structured approach, the first step in testing the 
code is to perform unit (module) testing. In general, the first module to be 
written should be the main control structure of the program that contains the 
highest interface levels. This main program structure is coded and tested first. 
Since no additional code is generally present, sometimes “dummy” modules, 
called test stubs, are used to test the interfaces. If legacy code modules are 
available for use, clearly they can serve to test the interfaces. If a prototype 
is to be constructed first, it is possible that the main control structure will be 
designed well enough to be reused largely intact in the final version. 

Each functional unit of code is subjected to a test, called unit or module 
testing, to determine whether it works correctly by itself. For example, sup¬ 
pose that company X pays an employee a base weekly salary determined by the 
employee’s number of years of service, number of previous incentive awards, 
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and number of hours worked in a week. The basic pay module in the payroll 
program of the company would have as inputs the date of hire, the current 
date, the number of hours worked in the previous week, and historical data 
on the number of previous service awards, various deductions for withholding 
tax, health insurance, and so on. The unit testing of this module would involve 
formulating a number of hypothetical (or real) work records for a week plus a 
number of hypothetical (or real) employees. The base pay would be computed 
with pencil, paper, and calculator for these test cases. The data would serve 
as inputs to the module, and the results (outputs) would be compared with the 
precomputed results. Any discrepancies would be diagnosed, the internal cause 
of the error (fault) would be located, and the code would be redesigned and 
rewritten to correct the error. The tests would be repeated to verify that the error 
had been eliminated. If the first code unit to be tested is the program control 
structure, it would define the software interfaces to other modules. In addition, 
it would allow the next phase of software testing—the integration test—to pro¬ 
ceed as soon as a number of units had been coded and tested. During the inte¬ 
gration test, one or more units of code would be added to the control structure 
(and any previous units that had been integrated), and functional tests would be 
performed along a path through the program involving the new unit(s) being 
tested. Generally, only one unit would be integrated at a time to make localiz¬ 
ing any errors easier, since they generally come from within the new module 
of code; however, it is still possible for the error to be associated with the 
other modules that had already completed the integration test. The integration 
test would continue until all or most of the units have been integrated into the 
maturing software system. Generally, module and many integration test cases 
are constructed by examining the code. Such tests are often called white box 
or clear box tests (the reason for these names will soon be explained). 

The system test follows the integration test. During the system test, a sce¬ 
nario is written encompassing an entire operational task that the software must 
perform. For example, in the case of air traffic control software, one might 
write a scenario that replicates aircraft arrivals and departures at Chicago’s 
O’Hare Airport during a slow period—say, between 11 and 12 P.M. This would 
involve radar signals as inputs, the main computer and software for the sys¬ 
tem, and one or more display processors. In some cases, the radar would not 
be present, but simulated signals would be fed to the computer. (Anyone who 
has seen the physical size of a large, modern radar can well appreciate why 
the radar is not physically present, unless the system test is performed at an 
air traffic control center, which is unlikely.) The display system is a “desk- 
size” console likely to be present during the system test. As the system test 
progresses, the software gradually approaches the time of release when it can 
be placed into operation. Because most system tests are written based on the 
requirements and specifications, they do not depend on the nature of the code; 
they are as if the code were hidden from view in an opaque or black box. 
Hence such functional tests are often called black box tests. 

On large projects (and sometimes on smaller ones), the last phase of testing 
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is acceptance testing. This is generally written into the contract by the cus¬ 
tomer. If the software is being written “in house,” an acceptance test would be 
performed if the company software development procedures call for it. A typi¬ 
cal acceptance test would contain a number of operational scenarios performed 
by the software on the intended hardware, where the location would be chosen 
from (a) the developer’s site, (b) the customer’s site, or (c) the site at which 
the system is to be deployed. In the case of air traffic control (ATC), the ATC 
center contains the present on-line system n and the previous system, n — 1, as 
a backup. If we call the new system n+ 1, it would be installed alongside n 
and n — 1 and operate on the same data as the on-line system. Comparing the 
outputs of system n +1 with system n for a number of months would constitute 
a very good acceptance test. Generally, the criterion for acceptance is that the 
software must operate on real or simulated system data for a specified number 
of hours or be subjected to a certain number of test inputs. If the acceptance 
test is passed, the software is accepted and the developer is paid; however, if 
the test is failed, the developer resumes the testing and correcting of software 
errors (including those found during the acceptance test), and a new acceptance 
test date is scheduled. 

Sometimes, “third party” testing is used, in which the customer hires an out¬ 
side organization to make up and administer integration, system, or acceptance 
tests. The theory is that the developer is too close to his or her own work and 
cannot test and evaluate it in an unbiased manner. The third party test group 
is sometimes an independent organization within the developer’s company. Of 
course, one wonders how independent such an in-house group can be if it and 
the developers both work for the same boss. 

The term regression testing is often used, describing the need to retest the 
software with the previous test cases after each new error is corrected. In the¬ 
ory, one must repeat all the tests; however, a selected subset is generally used 
in the retest. Each project requires a test plan to be written early in the develop¬ 
ment cycle in parallel with or immediately following the completion of speci¬ 
fications. The test plan documents the tests to be performed, organizes the test 
cases by phase, and contains the expected outputs for the test cases. Generally, 
testing costs and schedules are also included. 

When a commercial software company is developing a product for sale to 
the general business and home community, the later phases of testing are often 
somewhat different, for which the terms alpha testing and beta testing are often 
used. Alpha testing means that a test group within the company evaluates the 
software before release, whereas beta testing means that a number of “selected 
customers” with whom the developer works are given early releases of the 
software to help test and debug it. Some people feel that beta testing is just a 
way of reducing the cost of software development and that it is not a thorough 
way of testing the software, whereas others feel that the company still does 
adequate testing and that this is just a way of getting a lot of extra held testing 
done in a short period of time at little additional cost. 

During early held deployment, additional errors are found, since the actual 
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operating environment has features or inputs that cannot be simulated. Gener¬ 
ally, the developer is responsible for fixing the errors during early held deploy¬ 
ment. This responsibility is an incentive for the developer to do a thorough 
job of testing before the software is released because fixing errors after it is 
released could cost 25-100 times as much as that during the unit test. Because 
of the high cost of such testing, the contract often includes a warranty period 
(of perhaps 1-2 years or longer) during which the developer agrees to fix any 
errors for a fee. 

If the software is successful, after a period of years the developer and others 
will probably be asked to provide a proposal and estimate the cost of including 
additional features in the software. The winner of the competition receives a 
new contract for the added features. If during initial development the devel¬ 
oper can determine something about possible future additions, the design can 
include the means of easily implementing these features in the future, a process 
for which the term “putting hooks” into the software is often used. Eventually, 
once no further added features are feasible or if the customer’s needs change 
significantly, the software is discarded. 

5.3.8 Diagrams Depicting the Development Process 

The preceding discussion assumed that the various phases of software develop¬ 
ment proceed in a sequential fashion. Such a sequence is often called waterfall 
development because of the appearance of the symbolic model as shown in 
Fig. 5.2. This figure does not include a prototype phase; if this is added to the 
development cycle, the diagram shown in Fig. 5.3 ensues. In actual practice, 
portions of the system are sometimes developed and tested before the remain¬ 
ing portions. The term software build is used to describe this process; thus 
one speaks of build 4 being completed and integrated into the existing system 
composed of builds 1-3. A diagram describing this build process, called the 
incremental model of software development, is given in Fig. 5.4. Other related 
models of software development are given in Schach [1999]. 

Now that the general features of the development process have been 
described, we are ready to introduce software reliability models related to the 
software development process. 

5.4 RELIABILITY THEORY 
5.4.1 Introduction 

In Section 5.1, software reliability was defined as the probability that the soft¬ 
ware will perform its intended function, that is, the probability of success, 
which is also known as the reliability. Since we will be using the principles 
of reliability developed in Appendix B, Section B3, we summarize the devel¬ 
opment of reliability theory that is used as a basis for our software reliability 
models. 


Team-Fly 
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SOFTWARE LIFE-CYCLE 
DEVELOPMENT MODELS 
(WATERFALL MODEL) 



Figure 5.2 Diagram of the waterfall model of software development. 


5.4.2 Reliability as a Probability of Success 

The reliability of a system (hardware, software, human, or a combination 
thereof) is the probability of success, P s , which is unity minus the probability 
of failure, Pf. If we assume that t is the time of operation, that the operation 
starts at t = 0, and that the time to failure is given by tf, we can then express 
the reliability as 


R(t ) = P s = P(t f > t) = 1 - P f = 1 - P (0 < tf < t) 


(5.2) 
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SOFTWARE LIFE-CYCLE 
DEVELOPMENT MODELS 
(RAPID PROTOTYPE MODEL) 



Figure 5.3 Diagram of the rapid prototype model of software development. 


The notation, P( 0 < t/ < t), in Eq. (5.2) stands for the probability that the time 
to failure is less than or equal to t. Of course, time is always a positive value, 
so the time to failure is always equal to or greater than 0. Reliability can also 
be expressed in terms of the cumulative probability distribution function for 
the random variable time to failure, F(t), and the probability density function, 
f(t) (see Appendix A, Section A6). The density function is the derivative of 
the distribution function,/(f) = dF(t)/dt, and the distribution function is the 
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SOFTWARE LIFE-CYCLE 
DEVELOPMENT MODELS 
(INCREMENTAL MODEL WITH BUILDS) 



Figure 5.4 Diagram of the incremental model of software development. 


integral of the density function, Fit) = 1 \ fit) dt. Since by definition Fit) = 

Pi 0 < tf < t), Eq. (5.2) becomes 


R(t ) = 1 - F(t) = 1 - j f(t) dt (5.3) 

Thus reliability can be easily calculated if the probability density function for 
the time to failure is known. Equation (5.3) states the simple relationships 
among R(t), F(t), and/(f); given any one of the functions, the other two are 
easy to calculate. 
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5.4.3 Failure-Rate (Hazard) Function 

Equation (5.3) expresses reliability in terms of the traditional mathematical 
probability functions, Fit), and fit): however, reliability engineers have found 
these functions to be generally ill-suited for study if we want intuition, fail¬ 
ure data interpretation, and mathematics to agree. Intuition suggests that we 
study another function—a conditional probability function called the failure 
rate (hazard), z(t). The following analysis develops an expression for the reli¬ 
ability in terms of zit) and relates zit) to fit) and Fit). 

The probability density function can be interpreted from the following rela¬ 
tionship: 


Pit < tf < t + dt) = /’(failure in interval t to t + dt) =f(t ) dt (5.4) 


One can relate the probability functions to failure data analysis if we begin with 
N items placed on the life test at time t. The number of items surviving the 
life test up to time t is denoted by nit). At any point in time, the probability of 
failure in interval dt is given by (number of failures)///. (To be mathematically 
correct, we should say that this is only true in the limit as dt —» 0.) Similarly, 
the reliability can be expressed as R(t) = n(t) / N. The number of failures in 
interval dt is given by [nit) - n(t + dt)], and substitution in Eq. (5.4) yields 


n(t) - n(t + dt) 
N 


=/(0 dt 


(5.5) 


However, we can also write Eq. (5.4) as 

fit) dt =P(no failure in interval 0 to t) 

x P(failure in interval dt | no failure in interval 0 to t) (5.6a) 

The last expression in Eq. (5.6a) is a conditional failure probability, and the 
symbol | is interpreted as “given that.” Thus /’(failure in interval dt | no failure 
in interval 0 to t) is the probability of failure in 0 to t given that there was no 
failure up to t, that is, the item is working at time t. By definition, /’(failure 
in interval dt | no failure in interval 0 to t) is called the hazard function, zit): 
its more popular name is the failure-rate function. Since the probability of no 
failure is just the reliability function, Eq. (5.6a) can be written as 

f(t)dt = R(t)xz(t)dt (5.6b) 

This equation relates fit), R(t), and zit): however, we will develop a more 
convenient relationship shortly. 

Substitution of Eq. (5.6b) into Eq. (5.5) along with the relationship R(t) = 
n(t)/N yields 
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n(t) - n(t + dt) 
N 


R(t)z(t) dt 


n(t) 

N 


z(t) dt 


(5.7) 


Solving Eqs. (5.5) and (5.7) for/(f) and z(f), we obtain 


fit) = 

n(t) - n(t + dt) 

(5.8) 

N dt 

z(t) = 

n(t) - n(t + dt) 

(5.9) 

n(t) dt 


Comparing Eqs. (5.8) and (5.9), we see that/(f) reflects the rate of failure 
based on the original number N placed on test, whereas z,(t) gives the instan¬ 
taneous rate of failure based on the number of survivors at the beginning of 
the interval. 

We can develop an equation for R(t) in terms of z,(t) from Eq. (5.6b): 

fit) 

#>-m <5I0) 


and from Eq. (5.3), differentiation of both sides yields 


dR(t) 

dt 




(5.11) 


Substituting Eq. (5.11) into (5.10) and solving for z.(t) yields 


z(t) 



(5.12) 


This differential equation can be solved by integrating both sides, yielding 


ln{7?(f)} = -J z(t) dt (5.13a) 

Eliminating the natural logarithmic function in this equation by exponentiating 
both sides yields 


R(t) = e f J,)dt (5.13b) 

which is the form of the reliability function that is used in the following model 
development. 

If one substitutes limits for the integral, a dummy variable, x, is required 
inside the integral, and a constant of integration must be added, yielding 
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R(t) = z(x)dx + A = Be -\' 0 z(x)dx (5.13c) 

As is normally the case in the solution of differential equations, the constant 
B = e is evaluated from the initial conditions. At t = 0, the item is good and 
R(t = 0) = 1. The integral from 0 to 0 is 0; thus B = 1 and Eq. (5.13c) becomes 


R(t) = e~\ o z(x)dx 


(5.13d) 


5.4.4 Mean Time To Failure 

Sometimes, the complete information on failure behavior, z(t) or /(f), is not 
needed, and the reliability can be represented by the mean time to failure 
(MTTF) rather than the more detailed reliability function. A point estimate 
(MTTF) is given instead of the complete time function, R(t). A rough analogy 
is to rank the strength of a hitter in baseball in terms of his or her batting aver¬ 
age, rather than the complete statistics of how many times at bat, how many 
first-base hits, how many second-base hits, and so on. 

The mean value of a probability function is given by the expected value, 
E(t), of the random variable, which is given by the integral of the product of 
the random variable (time to failure) and its density function, which has the 
following form: 


MTTF = Eft) = I tf(t) dt (5.14) 

J o 

Some mathematical manipulation of Eq. (5.14) involving integration by parts 
[Shooman, 1990] yields a simpler expression: 

MTTF = E(t) = I R(t)dt (5.15) 

J o 

Sometimes, the mean time to failure is called mean time between failure 
(MTBF), and although there is a minor difference in their definitions, we will 
use the terms interchangeably. 

5.4.5 Constant-Failure Rate 

In general, a choice of the failure-rate function defines the reliability model. 
Such a choice should be made based on past studies that include failure-rate 
data or reasonable engineering assumptions. In several practical cases, the fail¬ 
ure rate is constant in time, z(t) = X, and the mathematics becomes quite simple. 
Substitution into Eqs. (5.13d) and (5.15) yields 
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R(t)Me^ Xdx = e Xt (5.16) 

I oo j 

e~ Xt dt=— (5.17) 

o ^ 

The result is particularly simple: the reliability function is a decreasing expo¬ 
nential function where the exponent is the negative of the failure rate X. A 
smaller failure rate means a slower exponential decay. Similarly, the MTTF is 
just the reciprocal of the failure rate, and a small failure rate means a large 
MTTF. 

As an example, suppose that past life tests have shown that an item fails at 
a constant-failure rate. If 100 items are tested for 1,000 hours and 4 of these 
fail, then X = 4/(100 x 1,000) =4 x 1(T 5 . Substitution into Eq. (5.17) yields 
MTTF =25,000 hours. Suppose we want the reliability for 5,000 hours; in that 
case, substitution into Eq. (5.16) yields R( 5,000) = e -( 4 / 100 - 00 °) x5 ' 000 = e -°~ = 
0.82. Thus, if the failure rate were constant at 4 x 10 5 , the MTTF is 25,000 
hours, and the reliability (probability of no failures) for 5,000 hours is 0.82. 

More complex failure rates yield more complex results. If the failure rate 
increases with time, as is often the case in mechanical components that even¬ 
tually “wear out,” the hazard function could be modeled by z(t ) = kt. The 
reliability and MTTF then become the equations that follow [Shooman, 1990]. 


R{t) = e -\'o kxdx ^e-kt 1 ' 2 (5.18) 

MTTF = E(t) = j e~ kf/2 dt = (5.19) 


Other choices of hazard functions would give other results. 

The reliability mathematics of this section applies to hardware failure and 
human errors, and also to software errors if we can characterize the software 
errors by a failure-rate function. The next section discusses how one can for¬ 
mulate a failure-rate function for software based on a software error model. 


5.5 SOFTWARE ERROR MODELS 
5.5.1 Introduction 

Many reliability models discussed in the remainder of this chapter are related 
to the number of residual errors in the software; therefore, this section dis¬ 
cusses software error models. Generally, one speaks of faults in the code that 
cause errors in the software operation; it is these errors that lead to system 
failure. Software engineers differentiate between a fault, a software error, and 
a software-caused system failure only when necessary, and the slang expres- 
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sion “software bug” is commonly used in normal conversation to describe a 
software problem. 3 

Software errors occur at many stages in the software life cycle. Errors may 
occur in the requirements-and-specifications phase. For example, the specifi¬ 
cations might state that the time inputs to the system from a precise cesium 
atomic clock are in hours, minutes, and seconds when actually the clock out¬ 
put is in hours and decimal fractions of an hour. Such an erroneous specifica¬ 
tion might be found early in the development cycle, especially if a hardware 
designer familiar with the cesium clock is included in the specification review. 
It is also possible that such an error will not be found until a system test, when 
the clock output is connected to the system. Errors in requirements and speci¬ 
fications are identified as separate entities; however, they will be added to the 
code faults in this chapter. If the range safety officer has to destroy a satellite 
booster because it is veering off course, it matters little to him or her whether 
the problem lies in the specifiations or whether it is a coding error. 

Errors occur in the program logic. For example, the THEN and ELSE 
clauses in an IF THEN ELSE statement may be interchanged, creating an error, 
or a loop is erroneously executed n —1 times rather than the correct value, which 
is n times. When a program is coded, syntax errors are always present and are 
caught by the compiler. Such syntax errors are too frequent, embarrassing, and 
universal to be considered errors. 

Actually, design errors should be recorded once the program management 
reviews and endorses a preliminary design expressed by a set of design repre¬ 
sentations (H-diagrams, control graphs, and maybe other graphical or abbrevi¬ 
ated high-level control-structure code outlines called pseudocodes) in addition 
to requirements and specifications. Often, a formal record of such changes is 
not kept. Furthermore, errors found by code reading and testing at the middle 
(unit) code level (called module errors ) are often not carefully kept. A change 
in the preliminary design and the occurrence of module test errors should both 
be carefully recorded. 

Oftentimes, the standard practice is not to start counting software errors, 


’The origin of the word "bug” is very interesting. In the early days of computers, many of the 
machines were constructed of vacuum tubes and relays, used punched cards for input, and used 
machine language or assembly language. Grace Hopper, one of the pioneers who developed 
the language COBOL and who spent most of her career in the U.S. Navy (rising to the rank 
of admiral), is generally credited with the expression. One hot day in the summer of 1945 at 
Harvard, she was working on the Mark II computer (successor to the pioneering Mark I) when 
the machine stopped. Because there was no air conditioning, the windows were opened, which 
permitted the entry of a large moth that (subsequent investigation revealed) became stuck between 
the contacts of one of the relays, thereby preventing the machine from functioning. Hopper and 
the team removed the moth with tweezers; later, it was mounted in a logbook with tape (it is now 
displayed in the Naval Museum at the Naval Surface Weapons Center in Dahlgren, Virginia). 
The expression “bug in the system” soon became popular, as did the term "debugging” to denote 
the fixing of program errors. It is probable that “bug” was used before this incident during World 
War II to describe system or hardware problems, but this incident is clearly the origin of the term 
“software bug” [Billings, 1989, p. 58], 
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regardless of their cause, until the software comes under configuration con¬ 
trol, generally at the start of integration testing. Configuration control occurs 
when a technical manager or management board is put in charge of the official 
version of the software and records any changes to the software. Such a change 
(error fix) is submitted in writing to the configuration control manager by the 
programmer who corrected the error and retested the code of the module with 
the design change. The configuration control manager retests the present ver¬ 
sion of the software system with the inserted change; if he or she agrees that it 
corrects the error and does not seem to cause any problems, the error is added 
to the official log of found and corrected errors. The code change is added 
to the official version of the program at the next compilation and release of 
a new, official version of the software. It is desirable to start recording errors 
earlier in the program than in the configuration control stage, but better late 
than never! The origin of configuration control was probably a reaction to the 
early days of program patching, as explained in the following paragraph. 

In the early days of programming, when the compilation of code for a large 
program was a slow, laborious procedure, and configuration control was not 
strongly enforced, programmers inserted their own changes into the compiled 
version of the program. These additions were generally done by inserting a 
machine language GO TO in the code immediately before the beginning of the 
bad section, transferring program flow to an unused memory block. The correct 
code in machine language was inserted into this block, and a GO TO at the end 
of this correction block returned the program flow to an address in the compiled 
code immediately after the old, erroneous code. Thus the error was bypassed; 
such insertions were known as patches. Oftentimes, each programmer had his 
or her own collection of patches, and when a new compilation of software 
was begun, these confusing, sometimes overlapping and chaotic sets of patches 
had to be analyzed, recoded in higher-level language, and officially inserted in 
the code. No doubt configuration control was instituted to do away with this 
terrible practice. 

5.5.2 An Error-Removal Model 

A software error-removal model can be formulated at the beginning of an inte¬ 
gration test (system test). The variable r is used to represent the number of 
months of development time, and one arbitrarily calls the start of configuration 
control r =0. At r = 0 , we assume that the software contains Ep total errors. 
As testing progresses, E c (j) errors are corrected, and the remaining number of 
errors, E r (j), is given by 


E r (j) = E t - E c (t ) (5.20) 

If some corrections made to discovered errors are imperfect, or if new errors 
are caused by the corrections, we call this error generation. Equation (5.20) is 
based on the assumption that there is no error generation—a situation that is 
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(a) 



(b) 


Errors added, E g (j ) 



(c) 

Figure 5.5 Cumulative errors debugged versus months of debugging, (a) Approach¬ 
ing equilibrium, horizontal asymptote, no generation of new errors; (b) approaching 
equilibrium, generation rate of new errors equal to error-removal rate; and (c) diverg¬ 
ing process, generation rate of new errors exceeding error-removal rate. 


illustrated in Fig. 5.5(a). Note that in the figure a line drawn through any time 
r parallel to the y-axis is divided into two line segments by the error-removal 
curve. The segment below the curve represents the errors that have been cor¬ 
rected, whereas the segment above the curve extending to E T represents the 
remaining number of errors, and these line segments correspond to the terms in 
Eq. (5.20). Suppose the software is released at time t\, in which case the figure 
shows that not all the errors have been removed, and there is still a small resid- 
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ual number remaining. If all the coding errors could be removed, there clearly 
would be no code-related reasons for software failures (however, there would 
still be requirements-and-specihcations errors). By the time integration test¬ 
ing is reached, we assume that the number of requirements-and-specihcations 
errors is very small and that the number of code errors gradually decreases as 
the test process finds more errors to be subsequently corrected. 

5.5.3 Error-Generation Models 

In Fig. 5.5(b), we assume that there is some error generation and that the error 
discovery and correction process must be more effective or must take longer 
to leave the software with the same number of residual errors at release as in 
Fig. 5.5(a). Figure 5.5(c) depicts an extraordinary situation in which the error 
removal and correction initially exceeds the error generation; however, gen¬ 
eration does eventually exceed correction, and the residual number of errors 
increases. In this case, the most obvious choices are to release at time T\ and 
suffer poor reliability from the number of residual errors, or else radically 
change the test and correction process so that the situation of Fig. 5.5(a) or 
(b) ensues and then continue testing. One could also return to an earlier saved 
release of the software where error generation was modest, change the test and 
correction process, and, starting with this baseline, return to testing. The last 
and most unpleasant choice is to discard the software and start again. (Quan¬ 
titative error-generation models are given in Shooman [1983, pp. 340-350].) 

5.5.4 Error-Removal Models 

Various models can be proposed for the error-correction function, E c (t), given 
in Eq. (5.20). The direct approach is to use the raw data. Error-removal data 
collected over a period of several months can be plotted. Then, an empirical 
curve can be fitted to the data, which can be extrapolated to forecast the future 
error-removal behavior. A better procedure is to propose a model based on 
past observations of error-removal curves and use the actual data to determine 
the model parameters. This blends the past information on the general shape 
of error-removal curves with the early data on the present project, and it also 
makes the forecasting less vulnerable to a few atypical data values at the start 
of the program (the statistical noise). Generally, the procedure takes a smaller 
number of observations, and a useful model emerges early in the development 
cycle—soon after r = 0. Of course, the estimate of the model parameters will 
have an associated statistical variance that will be larger at the beginning, when 
only a few data values are available, and smaller later in the project after more 
data is collected. The parameter variance will of course affect the range of the 
forecasts. If the project in question is somewhat like the previous projects, the 
chosen model will in effect filter out some of the statistical noise and yield bet¬ 
ter forecasts. However, what if for some reason the project is quite different 
from the previous ones? The “inertia” of the model will temporarily mask these 
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differences. Also, suppose that in the middle of testing some of the test per¬ 
sonnel or strategies are changed and the error-removal curve is significantly 
changed (for better or for worse). Again, the model inertia will temporarily 
mask these changes. Thus it is important to plot the actual data and examine it 
while one is using the model and making forecasts. There are many statistical 
tests to help the observer determine if differences represent statistical noise or 
different behavior; however, plotting, inspection, and thinking are all the initial 
basic steps. 

One must keep in mind that with modern computer facilities, complex mod¬ 
eling and statistical parameter estimation techniques are easily accomplished; 
the difficult part is collecting enough data for accurate, stable estimates of 
model parameters and for interpretation of the results. Thus the focus of this 
chapter is on understanding and interpretation, not on complexity. In many 
cases, the error removal is too scant or inaccurate to support a sophisticated 
model over a simple one, and the complex model shrouds our understanding. 
Consider this example: Suppose we wish to estimate the math skills of 1,000 
first-year high-school students by giving them a standardized test. It is too 
expensive to test all the students. If we decide to test 10 students, it is unlikely 
that the most sophisticated techniques for selecting the sample or processing 
the data will give us more than a wide range of estimates. Similarly, if we find 
the funds to test 250 students, then any elementary statistical techniques should 
give us good results. Sophisticated statistical techniques may help us make a 
better estimate if we are able to test, say, 50 students; however, the simpler 
techniques should still be computed first, since they will be understood by a 
wider range of readers. 

Constant Error-Removal Rate. Our development starts with the simplest mod¬ 
els. Assuming that the error-detection rate is constant leads to a single-param¬ 
eter error-removal model. In actuality, even if the removal rate were constant, 
it would fluctuate from week to week or month to month because of statistical 
noise, but there are ample statistical techniques to deal with this. Another fac¬ 
tor that must be considered is the delay of a few days or, occasionally, a few 
weeks between the discovery of errors and their correction. For simplicity, we 
will assume (as most models do) that such delays do not cause problems. 

If one assumes a constant error-correction (removal) rate of p q errors/month 
[Shooman, 1972, 1983], Eq. (5.20) becomes 


E r (T) = E T -p 0 T (5.21) 

We can also derive Eq. (5.21) in a more basic fashion by letting the error- 
removal rate be given by the derivative of the number of errors remaining. 
Thus, differentiation of Eq. (5.20) yields 


dE r (j ) dE c (r) 

error-correction rate = 


dr 


dr 


(5.22a) 
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Since we assume that the error-correction rate is constant, Eq. (5.22a) becomes 


dE r (r ) dE c (j) 

error-correction rate =-— =-= -p o 

dr dr 


(5.22b) 


Integration of Eq. (5.22b) yields 


E r (j) = C- p 0 r (5.22c) 

The constant C is evaluated from the initial condition at r = 0, E , (r) = Ej = C , 
and Eq. (5.22c) becomes 


E r (j) = E r - p 0 T (5.22d) 

which is, of course, identical to Eq. (5.21). The cumulative number of errors 
corrected is given by the second term in the equation, E c (t ) =por. 

Although there is some data to support a constant error-removal rate 
[Shooman and Bolsky, 1975], most practitioners observe that the error-removal 
rate decreases with development time, r. 

Note that in the foregoing discussion we always assumed that the same effort 
is applied to testing and debugging over the interval in question. Either the 
same number of programmers is working on the given phase of development, 
the same number of worker hours is being expended, or the same number and 
difficulty level of tests is being employed. Of course, this will vary from day 
to day; we are really talking about the average over a week or a month. What 
would really destroy such an assumption is if two people worked on testing 
during the first two weeks in a month and six tested during the last two weeks 
of the month. One could always deal with such a situation by substituting for 
r the number of worker hours, WH ; po would then become the number of 
errors removed per worker hour. One would think that WH is always available 
from the business records for the project. However, this is sometimes distorted 
by the “big project phenomenon,” which means that sometimes the manager 
of big project Z is told by his boss that there will be four programmers not 
working on the project who will charge their salaries to project Z for the next 
two weeks because they have no project support and Z is the only project that 
has sufficient resources to cover their salaries. In analyzing data, one should 
always be alert to the fact that such anomalies can occur, although the record 
of WH is generally reliable. 

As an example of how a constant error-removal rate can be used, consider a 
10,000-line program that enters the integration test phase. For discussion pur¬ 
poses, assume we are omniscient and know that there are 130 errors. Suppose 
that the error removal proceeds at the rate of 15 per month and that the error- 
removal curve will be as shown in Fig. 5.6. Suppose that the schedule calls for 
release of the software after 8 months. There will be 130 — 120 =10 errors 
left after 8 months of testing and debugging, but of course this information 
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Error-removal rate: 
errors/month 


Figure 5.6 Illustration of a constant error-removal rate. 


is unknown to the test team and managers. The error-removal rate in Fig. 5.6 
remains constant up to 8 months, then drops to 0 when testing and debugging 
is stopped. (Actually, there will be another phase of error correction when the 
software is released to the field and the field errors are corrected; however, this 
is ignored here.) The number of errors remaining is represented by the vertical 
line between the cumulative errors removed and the number of errors at the 
start. 

How significant are the 10 residual errors? It depends on how often they 
occur during operation and how they affect the program operation. A complete 
discussion of these matters will have to wait until we develop the software 
reliability models in subsequent sections. One observation that makes us a little 
uneasy about this constant error-removal model is that the cumulative error- 
removal curve given in Fig. 5.6 is linearly increasing and does not give us an 
indication that most of the residual errors have been removed. In fact, if one 
tested for about an additional two-thirds of a month, another 10 errors would be 
found and removed, and all the errors would be gone. Philosophically, removal 
of all errors is hard to believe; practical experience shows that this is rare, if 
at all possible. Thus we must look for a more realistic error-removal model. 


Linearly Decreasing Error-Removal Rate. Most practitioners have observed 
that the error-removal rate decreases with development time, r. Thus the next 
error-removal model we introduce is one that decreases with development time, 
and the simplest choice for a decreasing model is a linear decrease. If we 
assume that the error-removal rate decreases linearly as a function of time, 
r [Musa, 1975, 1987], then instead of Eq. (5.22a) we have 
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which represents a linearly decreasing error-removal rate. At some time ro, the 
linearly decreasing failure rate should go to 0, and substitution into Eq. (5.23a) 
yields = K\ /tq. Substitution into Eq. (5.23a) yields 


dE r (j) 

dr 



= -K 



(5.23b) 


which clearly shows the linear decrease. For convenience, the subscript on K 
was dropped since it was no longer needed. Integration of Eq. (5.23b) yields 


E,(r) = C Kt(^ \ - (5.23c) 

The constant C is evaluated from the initial condition at r = 0, E ,-{t) = Et = C, 
and Eq. (5.23c) becomes 


E,(j) = E t -Kt 



(5.23d) 


Inspection of Eq. (5.23b) shows that K is determined by the initial error- 
removal rate at r = 0. 

We now repeat the example introduced above to illustrate a linearly decreas¬ 
ing error-removal rate. Since we wish the removal of 120 errors after 8 months 
to compare with the previous example, we set E T = 130, and at r = 8, E r (j= 8) 
is equal to 10. Solving for K, we obtain a value of 30, and the equations for 
the error-correction rate and number of remaining errors become 


dE r (j) 

dr 


= -30 



(5.24a) 


E r (r) = 130- 30t (1 - 



(5.24b) 


The error-removal curve will be as shown in Fig. 5.7 and decreases to 0 at 
8 months. Suppose that the schedule calls for release of the software after 8 
months. There will be 130 - 120 =10 errors left after 8 months of testing 
and debugging, but of course this information is unknown to the test team 
and managers. The error-removal rate in Fig. 5.7 drops to 0 when testing and 
debugging is stopped. The number of errors remaining is represented by the 
vertical line between the cumulative errors removed and the number of errors 
at the start. These results give an error-removal curve that seems to become 
asymptotic as we approach 8 months of testing and debugging. Of course, the 
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Figure 5.7 Illustration of a linearly decreasing error-removal rate. 


decrease to 0 errors removed in 8 months was chosen to match the previous 
constant error-removal example. In practice, however, the numerical values of 
parameters K and to would be chosen to match experimental data taken during 
the early part of the testing. The linear decrease of the error rate still seems 
somewhat artificial, and a final model with an exponentially decreasing error 
rate will now be developed. 


Exponentially Decreasing Error-Removal Rate. The notion of an exponen¬ 
tially decreasing error rate is attractive since it predicts a harder time in finding 
errors as the program is perfected. Programmers often say they observe such 
behavior as a program nears release. In fact, one can derive such an expo¬ 
nential curve based on simple assumptions. Assume that the number of errors 
corrected, E c (j), is exactly equal to the number of errors detected, Ej(t), and 
that the rate of error detection is proportional to the number of remaining errors 
[Shooman, 1983, pp. 332-335]. 


dE d (j) 

dr 


aE r (T) 


(5.25a) 


Substituting for E,(t), from Eq. (5.20) and letting E,/(r) = E c (t) yields 


dE c (j) 

dr 


ol[E t - E c (t)\ 


(5.25b) 


Rearranging the differential equation given in Eq. (5.25b) yields 


dE c (r) 

dr 


+ olE c {t) = aE T 


(5.25c) 


To solve this differential equation, we obtain the homogeneous solution by 
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setting the right-hand side equal to 0 and substituting the trial solution E c (j ) = 
Ae aT into Eq. (5.25c). The only solution is when a = a. Since the right-hand 
side of the equation is a constant, the homogeneous solution is a constant. 
Adding the homogeneous and particular solutions yields 

E c (t) = Ae aT + B (5.25d) 

We can determine the constants A and B from initial conditions or by substi¬ 
tution back into Eq. (5.25c). Substituting the initial condition into Eq. (5.25d) 
when r =0, E c = 0 yields A + B = 0 or A = —B. Similarly, when r —> 

E c —» E t , and substitution yields B=E T . Thus Eq. (5.25d) becomes 

E c (t) = E t ( 1 - e- aT ) (5.25e) 

Substitution of Eq. (5.25e) into Eq. (5.20) yields 

E r (T) = E T e- aT (5.25f) 

We continue with the example introduced above to illustrate a linearly 
decreasing error-removal rate starting with E r ?= 130 at r = 0. To match the 
previous results, we assume that E, (t = 8) is equal to 10, and substitution into 
Eq. (5.25f) gives 10 = 130c 8 “. Solving for a by taking natural logarithms of 
both sides yields the value a = 0.3206. Substitution of these values leads to 
the following equations: 

dE f T) = -<xE T e- aT = -41.68e~°' 3206x (5.26a) 

dr 

E,(t) = U0e-° 3206T (5.26b) 

The error-removal curve is shown in Fig. 5.8. The rate starts at 41.68 at r= 
0 and decreases to 3.21 at r = 8. Theoretically, the error-removal rate continues 
to decrease exponentially and only reaches 0 at infinity. We assume, however, 
that testing stops after r = 8 and the removal rate falls to 0. The error-removal 
curve climbs a little more steeply than that shown in Fig. 5.7, but they both 
reach 120 errors removed after 8 months and stay constant thereafter. 

Other Error-Removal-Rate Models. Clearly, one could continue to evolve 
many other error-removal-rate models, and even though the ones discussed 
in this section should suffice for most purposes, we should mention a few 
other approaches in closing. All of these models assume a constant number 
of worker hours expended throughout the integration test and error-removal 
phase. On many projects, however, the process starts with a few testers, builds 
to a peak, and then uses fewer personnel as the release of the software nears. 
In such a case, an S-shaped error-removal curve ensues. Initially, the shape is 
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Figure 5.8 Illustration of an exponentially decreasing error-removal rate. 


concave upward until the main force is at work, at which time it is approxi¬ 
mately linear; then, toward the end of the curve, it becomes concave downward. 
One way to model such a curve is to use piecewise methods. Continuing with 
our error-removal example, suppose that the error-removal rate starts at 2 per 
month at r = 0 and increases to 5.4 and 14.77 after 1 and 2 months, respec¬ 
tively. Between 2 and 6 months it stays constant at 15 per month; in months 
7 and 8, it drops to 5.52 and 2 per month. The resultant curve is given in Fig. 
5.9. Since fewer people are used during the first 2 and last 2 months, fewer 
errors are removed (about 90 for the numerical values used for the purpose of 
illustration). Clearly, to match the other error-removal models, a larger number 
of personnel would be needed in months 3-6. 

The next section relates the reliability of the software to the error-removal- 
rate models that were introduced in this section. 



Cumulative errors 
removed 
Errors at start 
Error-removal rate: 
errors/month 


Figure 5.9 Illustration of an S-shaped error-removal rate. 
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5.6 RELIABILITY MODELS 
5.6.1 Introduction 

In the preceding sections, we established the mathematical basis of the reli¬ 
ability function and related it to the failure-rate function. Also, a number of 
error-removal models were developed. Both of these efforts were preludes to 
formulating a software reliability model. Before we become absorbed in the 
details of reliability model development, we should review the purpose of soft¬ 
ware reliability models. 

Software reliability models are used to answer two main questions during 
product development: When should we stop testing? and Will the product func¬ 
tion well and be considered reliable? Both are technical management questions; 
the former can be restated as follows: When are there few enough errors so 
that the software can be released to the held (or at least to the last stage of 
testing)? To continue testing is costly, but to release a product with too many 
errors is more costly. The errors must be fixed in the held at high cost, and 
the product develops a reputation for unreliability that will hurt its acceptance. 
The software reliability models to be developed quantify the number of errors 
remaining and especially provide a prediction of the held reliability, helping 
technical and business management reach a decision regarding when to release 
the product. The contract or marketing plan contains a release date, and penal¬ 
ties may be assessed by a contract for late delivery. However, we wish to avoid 
the dilemma of the on-time release of a product that is too “buggy” and thus 
defective. 

The other job of software reliability models is to give a prediction of held 
reliability as early as possible. Two many software products are released and, 
although they operate, errors occur too frequently; in retrospect, the projects 
become failures because people do not trust the results or tire of dealing with 
frequent system crashes. Most software products now have competitors, so 
consequently an unreliable product loses out or must be hxed up after release 
at great cost. Many software systems are developed for a single user for a spe¬ 
cial purpose, for example, air traffic control, IRS tax programs, social services’ 
record systems, and control systems for radiation-treatment devices. Failures 
of such systems can have dire consequences and huge impact. Thus, given 
requirements and a quality goal, the types of reliability models we seek are 
those that are easy to understand and use and also give reasonable results. The 
relative accuracy of two models in which one predicts one crash per week and 
another predicts two crashes per week may seem vastly different in a math¬ 
ematical sense. However, suppose a good product should have less than one 
crash a month or, preferably, a few crashes per year. In this case, both mod¬ 
els tell the same story—the software is not nearly good enough! Furthermore, 
suppose that these predictions are made early in the testing when only a little 
failure data is available and the variance produces a range of estimates that 
vary by more than two to one. The real challenge is to get practitioners to 
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collect data, use simple models, and make predictions to guide the program. 
One can always apply more sophisticated models to the same data set once the 
basic ideas are understood. The biggest mistake is to avoid making a reliability 
estimate because (a) it does not work, (b) it is too costly, and (c) we do not 
have the data. None of these reasons is correct or valid, and this fact represents 
poor management. The next biggest mistake is to make a model, obtain poor 
reliability predictions, and ignore them because they are too depressing. 

5.6.2 Reliability Model for Constant Error-Removal Rate 

The basic simplicity and some of the drawbacks of the simple constant error- 
removal model were discussed in the previous section on error-removal mod¬ 
els. Even with these limitations, this is the simplest place to start for us to 
develop most of the features of software reliability models based on this model 
before we progress to more complex ones [Shooman, 1972]. 

The major assumption needed to relate an error-removal model to a software 
reliability model is how the failure rate is related to the remaining number of 
errors. For the remainder of this chapter, we assume that the failure rate is 
proportional to the remaining number of errors: 


z(t) = kE r (j ) 


(5.27) 


The bases of this assumption are as follows: 

1. It seems reasonable to assume that more residual errors in the software 
result in higher software failure rates. 

2. Musa [1987] has experimental data supporting this assumption. 

3. If the rate of error discovery is a random process dependent on input and 
initial conditions, then the discovery rate is proportional to the number 
of residual errors. 

If one combines Eq. (5.27) with one of the software error-removal models of 
the previous section, then a software reliability model is defined. Substitution 
of the failure rate into Eqs. (5.13d) and (5.15) yields a reliability model R(t) 
and an expression for the MTTFs. 


As an example, we begin with the constant error-removal model, Eq. 
(5.22d), 


EM =E t - pq7 


(5.28a) 


Using the assumption of Eq. (5.27), one obtains 


z(T ) = kEM = k(E T - Pot) 


(5-29) 


and the reliability and MTTF expressions become 


Team-Ffy ® 
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7 


Figure 5.10 Variation of reliability function R(t) with operating time t for fixed val¬ 
ues of debugging time r. Note the time axis, t, is normalized. 


R(t) = HE,-pQT)dt _ e -k(E T - por)t 


MTTF = 


1 

k(E T - pqt) 


(5.30a) 

(5.30b) 


The two preceding equations mathematically define the constant error- 
removal rate software reliability model; however, there is still much to be said 
in an engineering sense about how we apply this model. We must have a proce¬ 
dure for estimating the model parameters, Ej, k, and po, and we must interpret 
the results. For discussion purposes, we will reverse the order: we assume that 
the parameters are known and discuss the reliability and MTTF functions first. 
Since the parameters are assumed to be known, the exponent in Eq. (5.30a) is 
just a function of r; for convenience, we can define k(E r - p 0 r) =y(r). Thus, 
as r increases, y decreases. Equation (5.30a) therefore becomes 

R(t) = e~ yt (5.31) 

Equation (5.31) is plotted in Fig. 5.10 in terms of the normalized time scale 
7 1. 

Let us assume that the project receives a minimum amount of testing and 
debugging during To months. There would still be quite a few errors left, and 
the reliability would be mediocre. In fact, Fig. 5.10 shows (see vertical dotted 
line) that when t = 1 /y, the reliability is 0.35, meaning that there is a 65% 
chance that a failure occurs in the interval 0 < t < l/y and a 35% chance that 
no errors occurs in this interval. This is rather poor and would not be satisfac¬ 
tory in any normal project. If predicted early in the integration test process, 
changes would be made. One can envision more vigorous testing that would 
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Figure 5.11 Plot of MTTF versus debugging time t, given by Eq. (5.32). Note the 
time axis, t, and the MTTF axis are both normalized. 


increase the parameter po and remove errors faster or, as we will discuss now, 
just test longer. Assume that the integration test period is lengthened to t\ > To 
months. More errors will be removed, 7 will be smaller, and the exponential 
curve will decrease more slowly as shown by the middle curve in the figure. 
There would be a 50% chance that a failure occurs in the interval 0 < t < I /7 
and a 50% chance that no error occurs in this interval—better, but still not 
good enough. Suppose the test is lengthened further to ti > t\ months, yield¬ 
ing a success probability of 75%. This might be satisfactory in some projects 
but would still not be good enough for really high reliability projects, so one 
should explore major changes. A different error-removal model would yield a 
different reliability function, predicting either higher or lower reliability, but 
the overall interpretation of the curves would be substantially the same. The 
important point is that one would be able to predict (as early as possible in test¬ 
ing) an operational reliability and compare this with the project specifications 
or observed reliabilities for existing software that serves a similar function. 

Similar results, but from a slightly different viewpoint, are obtained by 
studying the MTTF function. Normalization will again be used to simplify the 
plotting of the MTTF function. Note how a and /3 are defined in Eq. (5.32) 
and that 7 = 1 represents the point where all the errors have been removed and 
the MTTF approaches infinity. Note that the MTTF function initially increases 
almost linearly and slowly as shown in Fig. 5.11. Later, when the number of 
errors remaining is small, the function increases rapidly. The behavior of the 
MTTF function is the same as the function 1/x, as x —> 0. The importance 
of this effect is that the majority of the improvement comes at the end of the 
testing cycle; thus, without a model, a manager may say that based on data 
before the “knee” of the curve, there is only slow progress in improving the 
MTTF, so why not release the software and fix additional bugs in the held? 
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Given this model, one can see that with a little more effort, rapid progress is 
expected once the knee of the curve is passed, and a little more testing should 
yield substantial improvement. The fact that the MTTF approaches infinity as 
the number of errors approaches 0 is somewhat disturbing, but this will be 
remedied when other error-removal models are introduced. 


MTTF = 


1 

k(E T - pqt) 


1 

kE T ( 1 - pot/Ej) 


1 

/3(1 - ar) 


(5.32) 


One can better appreciate this model if we use the numerical data from the 
example plotted in Fig. 5.6. The parameters Ej and p o given in the example 
are 130 and 15, but the parameter k must still be determined. Suppose that k 
=0.000132, in which case Eq. (5.30a) becomes 

R(t) = £-0-000132(130- 15r)f (5 .33) 

At r = 8, the equation becomes 


R(t) = e -° 00132f (5.34a) 

The preceding is plotted as the middle curve in Fig. 5.12. Suppose that 
the software operates for 300 hours; then the reliability function predicts that 
there is a 67% chance of no software failures in the interval 0 < t < 300. If 
we assume that these software reliability estimates are being made early in the 
testing process (say, after 2 months), one could predict the effects—good and 
bad—of debugging for more or less than r = 8 months. (Again, we ask the 
reader to be patient about where all these values for E T , po, and k are coming 
from. They would be derived from data collected on the program during the 
first 2 months of testing. The discussion of the parameter estimation process 
has purposely been separated from the interpretation of the models to avoid 
confusion.) 

Frequently, management wants the technical staff to consider shortening the 
test period, since doing so would save project-development money and help 
keep the project on time. We can use the software reliability model to illustrate 
the effect (often disastrous) of such a change. If testing and debugging are 
shortened to only 6 months, Eq. (5.33) would become 

R(t) = e - 0 - 0052St (5.34b) 

Equation (5.34b) is plotted as the lower curve in Fig. 5.12. At 300 hours, 
there is only a 20.5% chance of no errors, which is clearly unacceptable. One 
might also show management the beneficial effects of slightly longer testing 
and debugging time. If we debugged for 8.5 months, then Eq. (5.34) would 
become 
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Figure 5.12 Reliability functions for constant error-removal rate and 6, 8, and 8.5 
months of debugging. See Eqs. (5.34a-c). 


R(t) = e -° 00033? (5.34c) 

Equation (5.34c) is plotted as the upper curve in Fig. 5.12, and the reliability 
at 300 hours is 90.6%—a very significant improvement. Thus the technical 
people on the project should lobby for a slightly longer integration test period. 

The overall interpretation of Fig. 5.12 leads to sensible conclusions; how¬ 
ever, the constant error-removal model breaks down when r is allowed to 
approach 8.67 months of testing. We see that Eq. (5.33) predicts that all the 
errors have been removed and that the reliability becomes unity. This effect 
becomes even clearer when we examine the MTTF function, and it is a good 
reason to progress shortly to the reliability models related to both the linearly 
decreasing and exponentially decreasing error-removal models. 

The MTTF function is given by Eq. (5.32), and substituting the numerical 
values Ej = 130, po =15, and k = 0.000132 (corresponding to 8 months of 
debugging) yields 


MTTF = 


1 

k(E T - p 0 T) 


1 _ 7575.75 

0.000132(130- 15t-) " (130- 15r) 


(5.35) 


The MTTF function given in Eq. (5.35) is plotted in Fig. 5.13 and listed in 
Table 5.2. The dramatic differences in the MTTF predicted by this model as 
the number of remaining errors rapidly approaches 0 seem difficult to believe 
and represent another reason to question constant error-removal-rate models. 


5.6.3 Reliability Model for a Linearly Decreasing Error-Removal Rate 

We now develop a reliability model for the linearly decreasing error-removal 
rate as we did with the constant error-removal-rate model. The linearly decreas- 
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Figure 5.13 MTTF function for a constant error-removal-rate model. See Eq. (5.35). 



MTTF versus months 
of debugging 


ing error-removal-rate model is given by Eq. (5.23d). Continuing with the 
example in use, we let E T = 130, K = 30, and To = 8, which led to Eq. (5.24b), 
and substitution yields the failure-rate function Eq. (5.29): 

z(t ) = kE r ( t) = kE r (?) = £[130 - 30r(l - r/16)] (5.36) 

and also yields the reliability function: 


TABLE 5.2 MTTF for Constant 
Error-Removal Model 


Total months of 

debugging 

8 

Formula for MTTF 

7,575.76 


130- 15t 

Elapsed months of 

debugging, r: 

MTTF 

0 

58.28 

2 

75.76 

4 

108.23 

6 

189.39 

8 

757.58 

8.5 

3,030.30 
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Figure 5.14 Reliability functions for the linearly decreasing error-removal-rate model 
and 6 and 8 months of debugging. See Eqs. (5.37c, d). 



Note that since we have chosen a linearly decreasing error model that goes 
to 0 at r = 8 months, there is no additional error removal between 8 and 8.5 
months. (Again, this may seem a little strange, but this effect will disappear 
when we consider the exponentially decreasing error-rate model in the next 
section.) The reliability functions given in Eqs. (5.37c, d) are plotted in Fig. 
5.14. Note that the reliability curve for 8 months of debugging is identical to 
the curve for the constant error-removal model given in Fig. 5.12. This occurs 
because we have purposely chosen the linearly decreasing error model to have 
the same area (cumulative errors removed) over 8 months as the constant error- 
removal-rate model ( the area of the triangle is the same as the area of the rect¬ 
angle). In the case of 6 months of debugging, the reliability function associated 
with the linearly decreasing error-removal model is better than that of the con¬ 
stant error-removal model. This is because the linearly decreasing model starts 
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TABLE 5.3 MTTF for Linearly Decreasing 
Error-Removal Model 


Total months of 
debugging 

Formula for MTTF 

Elapsed months of 
debugging, 7 : 

0 

2 

4 

6 

8 


8 

7,575.76 

[130- 30t(1-t/16)] 

MTTF 

58.28 

97.75 

189.39 

432.9 

757.58 


out at a higher removal rate and decreases; thus, over 6 months of debugging 
we take advantage of the higher error-removal rates at the beginning, whereas 
over 8 months the lower error-removal rates at the end balance the larger error- 
removal rates at the beginning. We will now develop the MTTF function for 
the linear error-removal case. 

The MTTF function is derived by substitution of Eq. (5.37a) into Eq. (5.15). 
Note that the integration in Eq. (5.15) is done with respect to t and the function 
z in Eq. (5.36), which multiplies t in the exponent of Eq. (5.37a) is a function 
of 7 (not t), so it is a constant in the integration used to determine MTTF. The 
result is 


MTTF = 


1 

*[130- 307(1 - 7/16)] 


(5.38a) 


We substitute the value chosen for k, k = 0.000132, and 7 = 8 into Eq. (5.38a), 
yielding 


MTTF = 


7575.76 

[130- 307(1 - 7/16)] 


(5.38b) 


The results of Eq. (5.38b) are given in Table 5.3 and Fig. 5.15. By com¬ 
paring Figs. 5.13 and 5.15 or, better, Tables 5.2 and 5.3, one observes that 
because of the way in which the constants were picked, the MTTF curves for 
the linearly decreasing error-removal and the constant error-removal models 
agree when 7 = 0 and 8. For intermediate values of 7 = 2, 4, 6, and so on, 
the MTTF for the linearly decreasing error-removal model is higher because 
of the initially higher error-removal rate. Since the linearly decreasing error- 
removal model was chosen to go to 0 at 7 = 8, the values of MTTF for 7 > 8 
really stay at 757.58. The model presented in the next section will remedy this 
counterintuitive result. 
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Figure 5.15 MTTF function for a linearly decreasing error-removal-rate model. See 
Eq. (5.38b). 


5.6.4 Reliability Model for an Exponentially Decreasing 
Error-Removal Rate 

An exponentially decreasing error-removal-rate model was introduced in Sec¬ 
tion 5.5.4, and the general shape of this function removed some of the anoma¬ 
lies of the constant and the linearly decreasing models. Also, it was shown 
in Eqs. (5.25a-e) that this exponential model was the result of assuming that 
error detection was proportional to the number of errors present. In addi¬ 
tion, many practitioners as well as theoretical modelers have observed that 
the error-removal rate decreases at a declining rate as testing increases (i.e., 
as r increases), which fits in with the hypothesis—one that is not too difficult 
to conceive—that early errors removed in a computer program are uncovered 
by tests. Later errors are more subtle and more “deeply embedded,” requir¬ 
ing more time and effort to formulate tests to uncover them. An exponential 
error-removal model has been proposed to represent these phenomena. 

Using the same techniques as those of the preceding sections, we will 
now develop a reliability model based on the exponentially decreasing error- 
removal model. The number of remaining errors is given in Eq. (5.25f): 

E r (r) = E T e- aT 


z(t ) = kE T e m 


(5.39a) 

(5.39b) 
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and substitution into Eq. (5.13d) yields the reliability function. 


R(t) = e- S kEre ~ aTdt = e (kEre ^ 7 >‘ (5.40) 

The preceding equation seems a little peculiar since it is an exponential func¬ 
tion raised to a power that in turn is an exponential function. However, it is 
really not that complicated, and this is where the mathematical assumptions 
that seem to be reasonable lead. To better understand the result, we will con¬ 
tinue with the running example that was introduced previously. 

To make our comparison between models, we have chosen constants that 
cause the error-removal function to begin with 130 errors at r = 0 and decrease 
to 10 errors at r = 8 months. Thus Eq. (5.39a) becomes 


E,-(t = 8) = 10 = 130e~“ 8 (5.41a) 

Solving this equation for a yields a = 0.3206. If we require the reliability 
function to yield a reliability of 0.673 at 300 hours of operation after r = 8 
months of debugging, substitution into Eq. (5.40) yields an equation allowing 
us to solve for k. 


*(300) = 0.673 = e -™-°-^x 8)3 oo (5 . 41b ) 

The value of k = 0.000132 is the same as that determined previously for the 
other models. Thus Eq. (5.40) becomes 

R{t) = e -mm 6 e^ )t (5.42a) 

The reliability function for r = 8 months is 

R(t) = g~to.ooi32)r (t = 8) (5.42b) 

Similarly, for r = 6 and 8.5 months, substitution into Eq. (5.42a) yields the 
reliability functions: 


R(t ) = e -' a002507 >' (t = 6 ) (5.42c) 

R(t) = e t°ooii25 )t (T = 8 5) (5.42d) 

Equations (5.42b-d) are plotted in Fig. 5.16. The reliability function for 
8 months of debugging is, of course, identical to the previous two models 
because of the way we have chosen the parameters. The reliability function 
for r = 6 months of debugging yields a reliability of 0.47 at 300 hours of 
operation, which is considerably better than the 0.21 reliability in the constant 
error-removal-rate model. This occurs because the exponentially decreasing 



248 SOFTWARE RELIABILITY AND RECOVERY TECHNIQUES 



Reliability: 8 months 
debugging 

Reliability: 6 months 
debugging 

Reliability: 8.5 months 
debugging 


0 100 200 300 


Time since start of operation, t, in hours 


Figure 5.16 Reliability functions for exponentially decreasing error-removal rate and 
6, 8, and 8.5 months of debugging. See Eqs. (5.42b-d). 


error-removal model eliminates more errors early and fewer errors later than 
the constant error-removal model; thus the loss of debugging between 6 < r < 8 
months is less damaging. This is the same reason why for r = 8.5 months of 
debugging the constant error-removal-rate model does better [R(t = 300) = 0.91] 
than [i?(? =300) = 0.71] for the exponential model. If we compare the expo¬ 
nential model with the linearly decreasing one, we find identical results at r = 8 
months and very similar results at t = 6 months, where the linearly decreasing 
model yields [R(t = 300) = 0.50] and the exponential model yields \R(t = 300) 
= 0.47], This is reasonable since the initial portion of an exponential function 
is approximately linear. As was discussed previously, the linearly decreasing 
model is assumed to make no debugging progress after r = 8 months; thus no 
comparisons at r = 8.5 months are relevant. 

The MTTF function for the exponentially decreasing model is computed by 
substituting Eq. (5.40) into Eq. (5.15) or more simply by observing that it is 
the reciprocal of the exponent given in Eq. (5.40): 

MTTF =--- (5.43a) 

kE T e- aT 

Substitution of the parameters k = 0.000132, Ej =130, and a =0.3206 into 
Eq. (5.43a) yields 


58.28 n 'nni 

MTTF = 03206t = 58.28e 03206 (5.43b) 

The MTTF curve given in Eq. (5.43b) is compared with those of Figs. 5.13 
and 5.15 in Fig. 5.17. Note that it is easier to compare the behavior of the three 
models introduced so far by inspecting the MTTF functions, than by comparing 
the reliability functions. For the purpose of comparison, we have constrained 


Team-Ffy * 
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Figure 5.17 MTTF function for constant, linearly decreasing, and exponentially 
decreasing error-removal-rate models. 


all the reliability functions to have the same reliability at t = 300 hours (0.67); 
of course, all the reliability curves start at unity at t = 0. Thus, the only com¬ 
parison we can make is how fast the reliability curves decay between t = 0 
and t = 300 hours. Comparison of the MTTF curves yields a bit more infor¬ 
mation since the curves are plotted versus t, which is the resource variable. 
All three curves in Fig. 5.17 start at 58 hours and increase to 758 hours after 
8 months of testing and debugging; however, the difference in the concave 
upward curvature between r = 2 and 8 months is quite apparent. The linearly 
decreasing and exponentially decreasing curves are about the same because 
at t = 6 months, the linear curve achieves an MTTF of 433 hours and the 
exponential curve is 399 hours, whereas the constant model only reaches 139 
errors. Thus, if we had data for the hist 2 months of debugging and wished to 
predict the progress as we approached the release time r = 8 months, any of 
the three models would yield approximately the same results. In applying the 
models, one would plot the actual error-removal rate and choose a model that 
best matches the actual data (experience would lead us to guess that this would 
be the exponential model). The real differences among the models are obvi¬ 
ous in the region between r = 8 and 10 months. The constant error-removal 
model climbs to °° when the debugging time approaches 8.66 months, which 
is anomalous. The linearly decreasing model ceases to make progress after 
8 months, which is again counterintuitive. Only the exponentially decreasing 
model continues to display progress after 8 months at a reasonable rate. Clearly, 
other more advanced reliability models can be (and have been) developed. 
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However, the purpose of this development is to introduce simple models that 
can easily be applied and interpreted, and a worthwhile working model appears 
to be the exponentially decreasing error-removal-rate model. The next section 
deals with the very important issue of how we estimate the constants of the 
model. 


5.7 ESTIMATING THE MODEL CONSTANTS 

5.7.1 Introduction 

The previous sections assumed values for the various model constants; for 
example, k, E r , and a in Eq. (5.40). In this section, we discuss the way to esti¬ 
mate values for these constants based on current project data (measurements) 
or past data. One can view this parameter estimation procedure as curve fit¬ 
ting to experimental data or as statistical parameter estimation. Essentially, this 
is the same idea from a slightly different viewpoint and using different meth¬ 
ods; however, the end result is the same: to determine parameters of the model 
based on early measurements of the project (or past data) that allow predic¬ 
tion of the future of the project. Before we begin our discussion of parameter 
estimation, it is useful to consider other phases of the project. 

In the previous section, we focused on the integration test phase. Software 
reliability models, however, can be applied to other phases of the project. Reli¬ 
ability predictions are most useful when they are made in the very early stages 
of the project, but during these phases so little detailed information is known 
that any predictions have a wide range of uncertainty (nevertheless, they are 
still useful guides). Toward the end of the project, during early held deploy¬ 
ment, a rash of software crashes indicates that more expensive (at this late date) 
debugging must be done. The use of a software reliability model can predict 
quantitatively how much more work must be done. If conditions are going 
well during deployment, the model can quantify how well, which is especially 
important if the contract contains a cost incentive. The same models already 
discussed can be used during the deployment phase. To apply software reli¬ 
ability to the earlier module (unit) test phase, another type of reliability model 
must be employed (this is discussed in Section 5.8 on other models). Perhaps 
the most challenging and potentially most useful phase for software reliability 
modeling is during the contracting and early design phases. Because no code 
has been written and none can be tested, any estimates that can be made depend 
on past project data. In fact, we will treat reliability model constant estimation 
based on past data as a general technique and call it handbook estimation. 

5.7.2 Handbook Estimation 

The simplest use of past data in reliability estimation may be illustrated as 
follows. Suppose your company specializes in writing payroll programs for 
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large organizations, and in the last 10 years you have written 78 systems of 
various sizes and complexities. In the last 5 years, reliability data has been 
kept and analyzed for 27 different systems. The data has been compiled along 
with explanations and analyses in a report that is called the company’s Reli¬ 
ability Handbook. The most significant events recorded in this handbook are 
system crashes that occur between one and four times per year for the 27 dif¬ 
ferent projects. In addition, data is recorded on minor errors that occur more 
frequently. A new client, company X, wants to have its antiquated, inadequate 
payroll program updated, and this new project is being called system (3. Com¬ 
pany X wants a quote for the development of system (3, and the reliability 
of the system is to be included in the quote along with performance details, 
development of system [3. and the reliability of the system is to be included 
in the quote along with performance details, and development schedule, the 
price, and so on. A study of the handbook reveals that the less complex sys¬ 
tems have an MTTF of one-half to one year. System [3 looks like a project 
of simple to medium complexity. It seems that the company could safely say 
that the MTTF for the system should be about one-half year but might vary 
from one-quarter to one year. This is a very comfortable situation, but sup¬ 
pose that the only recorded reliability data is on two systems. One data set 
represents in-house data; the other is a copy of a reliability report written by 
a conscientious customer during the first two years of operation who shared 
the report with you. Such data is better than nothing, but it is too weak to 
draw very detailed conclusions. The best action to take is to search for other 
data sources for system /3 and make it a company decision to improve your 
future position by beginning the collection of data on all new projects as well 
as those currently under development, and query past customers to see if they 
have any data to be shared. You could even propose that the “business data 
processing professional organization” to which you belong sponsors a reliabil¬ 
ity data collection process to be run by an industry committee. This committee 
could start the process by collecting papers reporting on relevant systems that 
have appeared in the literature. An anonymous questionnaire could be circu¬ 
lated to various knowledgeable people, encouraging them to contribute data 
with sufficient technical details to make listing these projects in a composite 
handbook useful, but not enough information so that the company or project 
can be identified. Clearly, the largest software development companies have 
such handbooks and the smaller companies do not. The subject of hardware 
reliability started in the late 1940s with the collection of component and some 
system reliability data spearheaded by Department of Defense funds. Unfortu¬ 
nately, no similar efforts have been sponsored to date in the software reliability 
held by Department of Defense funds or professional organizations. For a mod¬ 
est initial collection of such data, see Shooman [1983, p. 368, Table 5.10] and 
Musa [1987, p. 116, Table 5.2], 

From the data that does exist, we are able to compute a rough estimate for 
the parameter E T first introduced in Eq. (5.20) and present in all the models 
developed to this point. It seems unreasonable to report the same value for Ej 
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for both large and small programs; thus Shooman and Musa both normalize 
the value by dividing by the total number of source instructions It- For the 
data from Shooman, we exclude the values for the end-of-integration testing, 
acceptance testing, and simulation testing. This results in a mean value for 
Ej/I t °f 5.14 x 10 5 and a standard deviation of 4.23 x 10 3 for seven data 
points. Similarly, we make the same computation for the data in Table 5.2 of 
Musa [1987] for the 25 system test values and obtain a mean value for E T /I T 
of 7.85 x 10“ 3 and a standard deviation of 5.27 x 10~ 3 . These values are in 
rough agreement, considering the diverse data sources and the imperfection in 
defining what constitutes not only an error but the phases of development as 
well. Thus we can state that based on these two data sets we would expect a 
mean value of about 5-9 x |() 3 for E T /I T and a range from p, a (lowest for 
Shooman data) of about 1 x | () 3 to p + cr (highest for Musa data) of about 13 
x 10 3 . Of course, to obtain the value of Ey for any of the models, we would 
multiply these values by the value of I T for the project in question. 

What about handbook data for the initial estimation of any of the other 
model parameters? Unfortunately, little such data exists in collected form. For 
typical values, see Shooman [1983, p. 368, Table 5.10] and Musa [1987]. 


5.7.3 Moment Estimates 

The best way to proceed with parameter estimation for a reliability model is to 
plot the error-removal rate versus r on a simple graph with whatever intervals 
are used in recording the data (generally, daily or weekly). One could employ 
various statistical means to test which model best fits the data: a constant, a lin¬ 
ear, an exponential, or another model, but inspection of the graph is generally 
sufficient to make such a determination. 


Constant Error-Removal-Rate Data. Suppose that the error-removal data 
looks approximately constant and that the time axis is divided into regular 
or irregular intervals, At,-, corresponding to the data, and that in each interval 
there are E c (At,) corrected errors. Thus the data for the error-correction rate 
is a sequence of values E c (At,)/At,-. The simplest way to estimate the value 
of po is t° take the mean value of the error-correction rates: 


Po = - X 

i 


Ecifr /) 
At,' 


(5.44) 


Thus, by examining Eqs. (5.30a, b), we see that there are two additional param¬ 
eters to estimate: k and Ej. 

The estimate given in Eq. (5.44) utilizes the mean value that is the first 
moment and belongs to a general class of statistical estimates called moment 
estimates. The general idea of applying moment estimation to the evaluation of 
parameters for probability distributions (models) is to first compute a number 
of moments of the probability distribution equal to the number of parameters 
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to be estimated. The moments are then computed from the numerical data; the 
first moment formula is equated to the first moment of the data, the second 
moment formula is equated to the second moment of the data, and so on until 
enough equations are formulated to solve for the parameters. Since we wish 
to estimate k and E r in Eqs. (5.30a, b), two moment equations are needed. 
Rather than compute the first and second moments, we use a slight variation 
in the method and compute the first moment at two different values of r,-, T\, 
and T 2 - Since the random variable is time to failure, the first moment (mean) 
is given by Eq. (5.30b). To compute the mean of the data, we require a set 
of test data from which we can calculate mean time to failure. The best data 
would of course be operational data, but since the software is being integrated, 
it would be difficult to place it into operation. The next best data is simulated 
operational data, generally obtained by testing the software in a simulated oper¬ 
ational mode by using specially prepared software. Such software is generally 
written for use at the end of the test cycle when comprehensive system tests are 
performed. It is best that such software be developed early in the test cycle so 
that it is available for “reliability testing” during integration. Such simulation 
testing is time-consuming, it can be employed during off hours (e.g., second 
and third shift) so that it does not interrupt the normal development schedule. 
(Musa [1987] has written extensively on the use of ordinary integration test 
results when simulation testing is not available. This subject will be discussed 
later.) Simulation testing is based on a number of scenarios representing dif¬ 
ferent types of operation and results in n total runs, with r failures and n r 
successes. The n — r successful runs represent 7j, 73,... , 7), hours of suc¬ 
cessful operation and the r unsuccessful runs represent t \, to, ■ ■ ■, t r hours of 
successful operation before the failures occur. Thus the testing produces H total 
hours of successful operation. 


n - r r 

H= Z Tj+ Z ti (5.45) 

i= 1 i = 1 

Assuming that the failure rate is constant over the test interval (no debugging 
occurs while we are testing), the failure rate is given by z = X: 

X- ' [f (5.46a) 


and since the MTTF is the reciprocal, 


1 H 

MTTF = — = — (5.46b) 

X r 

Thus, applying the moment method reduces to matching Eqs. (5.30b) and 
(5.46b) at times r„ and r* in the development cycle, yielding 
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H a 

MTTIv, = — 

r a 


1 

k(E T ~ p 0 Ta) 


H b 

MTTF fo = — 

r b 


1 

k(E r - p 0 T b ) 


(5.47a) 

(5.47b) 


Because p () is already known, the two preceding equations can be solved for 
the parameters k and Ej , and our model is complete. [One could have skipped 
the evaluation of po using Eq. (5.44) and generated a third MTTF equation 
similar to Eqs. (5.47a, b) at a third development time T-). The three equations 
could then have been solved for the three parameters. The author feels that 
fitting as many parameters as possible from the error-removal data followed 
by using the test data to estimate the remaining data is a superior procedure.] 
If we apply this model as integration continues, a sequence of test data will be 
accumulated and the question arises: Which two sets of test data will be used 
in Eqs. (5.47a, b)—the last two or the first and the last? This issue is settled 
if we use least-squares or maximum-likelihood methods of estimation (which 
will soon be discussed) since they both use all available sets of test data. In any 
event, the use of the moment estimates described in this section is always a 
good starting point in building a model, even if more advanced methods will be 
used later. The reader must realize that the significant costs and waiting periods 
for applying such models are associated with the test results. The analysis takes 
at most one-half of a day, and if calculation programs are used, even less time 
than that. Thus it is suggested that several models be calculated and compared 
as the project progresses whenever new test data is available. 


Linearly Decreasing Error-Removal-Rate Data. Suppose that inspection of 
the error-removal data reveals that the error-removal rate decreases in an 
approximately linear manner. Examination of Eq. (5.23b) shows that there are 
two parameters in the error-removal-rate model: K and t 0 . In addition, there 
is the parameter E T and, from Eq. (5.27), the additional parameter k. We have 
several choices regarding the evaluation of these four constants. One can use 
the error-removal-rate curve to evaluate two of these parameters, K and to, and 
use the test data to evaluate k and E T as was done in the previous section in 
Eqs. (5.47a, b). 

The simplest procedure is to evaluate K and To using the error-removal rates 
during the first two test intervals. The error-removal rate is found by differen¬ 
tiating [cf. Eqs. (5.23d) and (5.24a)]. 


dE r (j) 

dr 



(5.48a) 


If we adopt the same notation as used in Eq. (5.44), the error-removal rate 
becomes E c (ATj)/ATj. If we match Eq. (5.48a) at the midpoints of the first two 
intervals, t„/2 and t„ + r b /2, the following two equations result: 
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E c (Atu) _ y Ta 

At a V 4t o 

Ec(Ar h ) _ k L_ Tg + Tb /2 \ 

At/, V 2r 0 / 


(5.48b) 

(5.48c) 


and they can be solved for K and t 0 . This leaves the two parameters k and E r , 
which can be evaluated from test data in much the same way as Eqs. (5.47a, 
b). The two equations are 

H a 

MTTF a = — = 

r a 

H h 

MTTF fo = — = 

r b 


Exponentially Decreasing Error-Removal-Rate Data. Suppose that inspec¬ 
tion of the error-removal data reveals that the error-removal rate decreases in 
an approximately exponential manner. One good way of testing this assump¬ 
tion is to plot the error-removal-rate data on a log-log graph by computer or on 
graph paper. An exponential curve rectifies on log-log axes. (There are more 
sophisticated statistical tests to check how well a set of data fits an exponential 
curve. See Shooman [1983, p. 28, problem 1.3] or Hoel [1971].) If Eq. (5.40) 
is examined, we see that there are three parameters to estimate k, E T , and a. 
As before, we can estimate some of these parameters from the error-removal- 
rate data and some from simulation test data. One can probably investigate 
which parameters should be estimated from one set of data and which from the 
other sets should be estimated via theoretical arguments; however, the practical 
approach is to use the better data to estimate as many parameters as possible. 
Error-removal data is universally collected whenever the software comes under 
configuration control, but simulation test data requires more effort and expense. 
Error-removal data is therefore more plentiful, allowing the estimation of as 
many model parameters as possible. Examination of Eq. (5.25e) reveals that 
E t and a. can be estimated from the error data. Estimation equations for E T 
and a begin with Eq. (5.25e). Taking the natural logarithm of both sides of 
the equation yields 


E t - K 1 — 


2t 0 


(5.49a) 


E t - K 1 - 


Tb 

In-.. 


(5.49b) 


ln[£,(T)} = lnfEj-} - ar (5.50a) 

If we have two sets of error-removal data at T a and Tb, Eq. (5.50a) can be used 
to solve for the two parameters. Substitution yields 



256 SOFTWARE RELIABILITY AND RECOVERY TECHNIQUES 


In {E r (j a )} = ln{£ r ] - (5.50b) 

In {E r (T b )} = ln{£ r ] - ar h (5.50c) 

Subtracting the second equation from the first and solving for a yields 

ln{£ c .(r rt )} - In {E c (j b )} 

a- - (5.51) 

Tb — T a 


Knowing the value of a, one could substitute into either Eq. (5.50b) or (5.50c) 
to solve for Ej. However, there is a simple way to use information from both 
equations (which should be a better estimate) by adding the two equations and 
solving for E T . 


ln{£ , 7’} 


ln{E c (T a )} + ln{EV(T/,)} + a{T a + r b ) 
2 


(5.52) 


Once we know Ej and a, one set of integration test data can be used to deter¬ 
mine k. From Eq. (5.43a), we proceed in the same manner as Eq. (5.47a); 
however, only one test time is needed. 


H a 

MTTF a = — 
r a 


1 

kETe~ a7a 


(5.53) 


5.7.4 Least-Squares Estimates 

The moment estimates of the preceding sections have a number of good 
attributes: 

1. They require the least amount of data. 

2. They are computationally simple. 

3. They serve as a good starting point for more complex estimates. 

The computational simplicity is not too significant in this era of cheap, fast 
computers. Nevertheless, it is still a good idea to use a calculator, pencil, and 
paper to get a feeling for data values before a more complex, less transparent, 
more accurate computer algorithm is used. 

The main drawback of moment estimates is the lack of clear direction for 
how to proceed when several data sets are available. The simplest procedure 
in such a case is to use least-squares estimation. A complete development of 
least-squares estimation appears in Shooman [1990] and is applied to soft¬ 
ware reliability modeling in Shooman [1983, pp. 372-374]. However, com¬ 
puter mathematics packages such as Mathematica, Mathcad, Macsyma, and 
Maple all have least-squares programs that are simple to use; any increased 
complexity is buried within the program, and computational time is not signif- 



ESTIMATING THE MODEL CONSTANTS 257 


icant with modem computers. We will briefly discuss the use of least-squares 
estimation for the case of an exponentially decreasing error-removal rate. 

Examination of Eq. (5.50a) shows that on log-log paper, the equation 
becomes a straight line. It is recommended that the data be initially plotted 
and a straight line be fitted by inspection through the data. When r = 0, the 
y-axis intercept, E c (j = 0) is equal to E r , and the slope of the straight line is 
- a. Once these initial estimates have been determined, one can use a least- 
squares program to find the mean values of the parameters and their variances. 

In a similar manner, one can determine the value of k by substitution in Eq. 

(5.53) for one set of simulation data. Assuming that we have several sets of 
simulation data at tj = a,b, ... , we can write the equation as 

H. 

InJMTTF/} = -*- = -[ln{&] + ln{£ r ] - cut,-] (5.54) 

r j 

The preceding equation is used as the basis of a least-squares estimation 
to determine the mean value and variance of k. Again, it is useful to plot Eq. 

(5.54) and fit a straight line to the data as a precursor to program estimation. 

5.7.5 Maximum-Likelihood Estimates 

In England in the 1930s, Fisher developed the elegant theory called maximum- 
likelihood estimation (MLE) for estimating the values of parameters of proba¬ 
bility distributions from data [Shooman, 1983, pp. 537-540; Shooman, 1990, 
pp. 80-96]. We can explain some of the ideas underlying MLE in a simple 
fashion. If R(t) is the reliability function, then fit) is the associated density 
function for the time to failure, and the parameters are 9 \, 62 , and so forth, 
and we have f(9 1 , 62 , ■■■, 9,, t). The data are the several values of time to fail¬ 
ure t\, ? 2 ,..., ti, and the task is to estimate the best values for 61 , 62 ,... ,9i 
from the data. Suppose there are two parameters, 6 \ and 9 2 , and three val¬ 
ues of time data: t\ = 50, ti = 200, and /y = 490. If we know the values of 
9 1 and 62 , then the probability of obtaining the test values is related to the 
joint likelihood function (assuming independence), L(9\, 9 2 ) = f(9 1 , 92 , 50) • 
f( 6 i, 92 , 200 ) ■ fid j, 6 * 2 ,490). Fisher’s brilliant procedure was to compute val¬ 
ues of 9 1 and 62 , which maximized L. To find the maximum of L, one computes 
the partial derivatives of L with respect to 9\ and 62 and sets these values to 
zero. The resultant equations are solved for the MLE values of 9\ and (9i- 
If there are more than two parameters, more partial derivative equations are 
needed. The application of MLE to software reliability models is discussed in 
Shooman [1983, pp. 370-372, 544-548]. 

The advantages of MLE estimates are as follows: 

1. They automatically handle multiple data sets. 

2. They provide variance estimates. 
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3. They have some sophisticated statistical evaluation properties. 

Note that least-squares estimation also possesses the first two properties. 
Some of the disadvantages of MLE estimates are as follows: 

1. They are more complex and more difficult to understand than moment 
or least-squares estimates. 

2. MLE estimates involve the solution of a set of complex equations that 
often requires numerical solution. (Moment or least-squares estimates 
can be used as starting values to expedite the numerical solution.) 

The way of overcoming the first problem in the preceding list is to start 
with moment or least-squares estimates to develop insight, whereas the second 
problem requires development of a computer estimation program, which takes 
some development effort. Fortunately, however, such programs are available; 
among them are SMERFS [Farr, 1991; Lyu, 1996, pp. 733-735]; SoRel [Lyu, 
1996, pp. 737-739]; CASRE [Lyu, 1996, pp. 739-745]; and others [Strark, 
Appendix A in Lyu, 1996, pp. 729-745]. 

5.8 OTHER SOFTWARE RELIABILITY MODELS 

5.8.1 Introduction 

Since the first software reliability models were introduced [Jelinski and 
Moranda, 1972; Shooman, 1972], there have been many software reliability 
models developed. The ones introduced in the preceding section are simple 
to understand and apply. In fact, depending on how one counts, the 4 models 
(constant, linearly decreasing, exponentially decreasing, and S-shaped) along 
with the 3 parameter estimation methods (moment, least-squares, and MLE) 
actually form a group of 12 models. Some of the other models developed in 
the literature are said to have better “mathematical properties” than these sim¬ 
ple models. However, the real test of a model is how well it performs, that 
is, if data is taken between months 1 and 2 of an 8-month project, how well 
does it predict at the end of month 2 the growth in MTTF or the decreasing 
failure rate between months 3 and 8. Also, how does the prediction improve 
after data for months 3 and 4 is added, and so forth. 

5.8.2 Recommended Software Reliability Models 

Software reliability models are not used as universally in software development 
as they should be. Some reasons that project managers give for this are the 
following: 

1. It costs too much to do such modeling and I can’t afford it within my 
project budget. 
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2. There are so many software reliability models to use that I don’t know 
which is best; therefore, I choose not to use any. 

3. We are using the most advanced software development strategies and 
tools and produce high-quality software; thus we don’t need reliability 
measurements. 

4. Even if a model told me that the reliability will be poor, I would just test 
some more and remove more errors. 

5. If I release a product with too many errors, I can always fix those that 
get discovered during early held deployment. 

Almost all of these responses are invalid. Regarding response (1), it does not 
cost that much to employ software reliability models. During integration test¬ 
ing, error collection is universally done, and the analysis is relatively inexpen¬ 
sive. The only real cost is the scheduling of the simulation/system test early in 
integration testing, and since this can be done during off hours (second and third 
shift), it is not that expensive and does not delay development. (Why do managers 
always state that there is not enough money to do the job right, yet always find 
lots of money to fix residual errors that should have been eliminated much earlier 
in the development process?) Response (3) has been the universal cry of software 
development managers since the dawn of software, and we know how often this 
leads to grief. Responses (4) and (5) are true and have some merit; however, the 
cost of fixing a lot of errors at these late stages is prohibitive, and the delivery 
schedule and early reputation of a product are imperiled by such an approach. 
This leaves us with response (2), which is true and for which some of the models 
are mathematically sophisticated. This is one of the reasons why the preceding 
section’s treatment of software reliability models focused on the simplest mod¬ 
els and methods of parameter estimation in the hope that the reader would follow 
the development and absorb the principles. 

As a direct rebuttal to response (2), a group of experienced reliability 
modelers (including this author) began work in the early 1990s to produce 
a document called Recommended Practice for Software Reliability (a soft¬ 
ware reliability standard) [AIAA/ANSI, 1993]. This standard recommends 
four software reliability models: the Schneidewind model, the generalized 
exponential model [Shooman, April 1990], the Musa/Okumoto model, and the 
Littlewood/Verrall model. A brief study of the models shows that the general¬ 
ized exponential model is identical with the three models discussed previously 
in this chapter. The basic development described in the previous section corre¬ 
sponds to the earliest software reliability models [Jelinski and Moranda, 1972; 
Shooman, 1972], and the constant error-removal-rate model [Shooman, 1972]. 
The linearly decreasing error-removal-rate model is essentially Musa’s basic 
model [1975], and the exponentially decreasing error-removal-rate model is 
Musa’s logarithmic model [1987]. Comprehensive parameter estimation equa¬ 
tions appear in the AIAA/ANSI standard [1993] and in Shooman [1990]. The 
reader is referred to these references for further details. 
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5.8.3 Use of Development Test Data 

Several authors, notably Musa, have observed that it would be easiest to use 
development test data where the tests are performed and the system operates 
for T hours rather than simulating real operation where the software runs for t 
hours of operation. We assume that development tests stress the system more 
“rapidly” than simulated testing—that T = Ct and that C > 1. In practice, Musa 
found that values of 10-15 are typical for C. If we introduce the parameter C 
into the exponentially decreasing error-rate model (Musa’s logarithmic model), 
we have an additional parameter to estimate. Parameters E T and a can be esti¬ 
mated from the error-removal data; k and C, from the development test data. 
This author feels that the use of simulation data not requiring the introduction 
of C is superior; however, the use of development data and the necessary intro¬ 
duction of the fourth parameter C is certainly convenient. If such a method is 
to be used, a handbook with data listing previous values of C and judicious 
choices from the previous results would be necessary for accurate prediction. 


5.8.4 Software Reliability Models for Other Development Stages 

The software reliability models introduced so far are immediately applicable 
to integration testing or early held deployment stages. (Later held deployment, 
too, is applicable, but by then it is often too late to improve a bad product; a 
good product is apparent to everybody and needs little further debugging.) The 
earlier one can employ software reliability, the more useful the models are in 
predicting the future. However, during unit (module testing), other models are 
required [Shooman, 1983, 1990]. 

Software reliability estimation is of great use in the specification and early 
design phases as a means of estimating how good the product can be made. 
Such estimates depend on the availability of held data on other similar past 
projects. Previous project data would be tabulated in a “handbook” of previ¬ 
ous projects, and such data can be used to obtain initial values of parameters 
for the various models by matching the present project with similar historical 
projects. Such handbook data does exist within the databases of large software 
development organizations, but this data is considered proprietary and is only 
available to workers within the company. The existence of a “software reliabil¬ 
ity handbook” in the public domain would require the support of a professional 
or government organization to serve as a sponsor. 

Assuming that we are working within a company where such data is avail¬ 
able early in the project (perhaps even during the proposal phase), early esti¬ 
mates can be made based on the use of historical data to estimate the model 
parameters. Accuracy of the parameters depends on the closeness of the match 
between handbook projects and the current one in question. If a few projects 
are acceptable matches, one can estimate the parameter range. 

If one is fortunate enough to possess previous data and, later, to obtain 
system test data, one is faced with the decision regarding when the previous 
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project data is to be discarded and when the system test data can be used to 
estimate model parameters. The initial impulse is to discard neither data set 
but to average them. Indeed, the statistical approach would be to use Bayesian 
estimation procedures (see Mood and Graybill [1963, p. 187]), which may be 
viewed as an elaborate statistical-weighting scheme. A more direct approach is 
to use a linear-weighting scheme. Assume that the historical project data leads 
to a reliability estimate for the software given by Ro(t), and the reliability esti¬ 
mate from system test data is given by Ri(t). The composite estimate is given 
by 


R(t) = aoRo(t) + (5.55) 


It is not difficult to establish that should be set equal to unity. Before 

test data is available, ao will be equal to unity and ci\ will be 0; as test data 
becomes available, ao will approach 0 and a\ will approach unity. The weight¬ 
ing procedure is derived by minimizing the variance of R(t), assuming that the 
variance of Ro(t) is given by <Jq and that of R\(t) by erf. The end result is a 
set of weighting formulas given by the equations that follow. (For details, see 
Shooman [1971].) 
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The reader who has studied electric-circuit theory can remember the form 
of these equations by observing that they are analogous to how resistors com¬ 
bine in parallel. To employ these equations, the analyst must estimate a value 
of Uq based on the variability of the previous project data and use the value of 
aj given by applying the least-squares (or another) method to the system test 
data. 

The problems at the end of this chapter provide further exploration of other 
models, the parameter estimation, the numerical differences among the meth¬ 
ods, and the effect on the reliability and MTTF functions. For further details 
on software reliability models, the reader is referred to AIAA/ANSI standard 
[1993], Musa [1987], and Lyu [1996]. 
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5.8.5 Macro Software Reliability Models 

Most of the software reliability models in the literature are black box models. 
There is one clear box model that relates the software reliability to some fea¬ 
tures of the program structure [Shooman, 1983, pp. 377-384; Shooman, 1991]. 
This model decomposes the software into major execution paths of the control 
structure. The software failure rate is developed in terms of the frequency of 
path execution, the probability of error along a path, and the traversal time for 
the path. For more details, see Shooman [1983, 1991]. 


5.9 SOFTWARE REDUNDANCY 
5.9.1 Introduction 

Chapters 3 and 4 discussed in detail the various ways one can employ redundancy 
to enhance the reliability of the hardware. After a little thought, we raise the ques¬ 
tion: Can we employ software redundancy? The answer is yes; however, there are 
several issues that must be explored. A good way to introduce these considera¬ 
tions is to assume that one has a TMR system composed of three identical digital 
computers and a voter. The preceding chapter detailed the hardware reliability 
for such a system, but what about the software? If each computer contains a copy 
of the same program, then when one computer experiences a software error, the 
other two should as well. Thus the three copies of the software provide no redun¬ 
dancy. The system model would be a hardware TMR system in series with the 
software reliability, and the system reliability, R sys , would be given by the prod¬ 
uct of the hardware voting system, R tmr , and the software reliability, R S oftware> 
assuming independence between the hardware and software errors. We should 
actually speak of two types of software errors. The first type is the most common 
one due to a scenario with a set of inputs that uncovers a latent fault in the soft¬ 
ware. Clearly, all copies of the same software will have that same fault and should 
process the scenario identically; thus there is no software redundancy. Flowever, 
some software errors are due to the interaction of the inputs, the state of the hard¬ 
ware, and any residual faults. By the state of the hardware we mean the storage 
values in registers (maybe other storage devices) at the time the scenario is begun. 
Since these storage values are dependent on when the computer is powered up 
and cleared as well as the past data processed, the states of the three processors 
may differ. There may be a small amount of redundancy due to these effects, but 
we will ignore state-dependent errors. 

Based on the foregoing discussion, the only way one can provide software 
reliability is to write different independent versions of the software. The cost 
is higher, of course, and there is always the chance that even independent pro¬ 
gramming groups will incorporate the same (common mode) software errors, 
degrading the amount of redundancy provided. A complete discussion appears 
in Shooman [1990, pp. 582-587]. A summary of the relevant analysis appears 
in the following paragraphs, as well as an example of how modular hardware 
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and software redundancy is employed in the Space Shuttle orbital flight control 
system. 


5.9.2 N -Version Programming 

The official term for separately developed but functionally identical versions of 
software is N-version software. We provide only a brief summary of these tech¬ 
niques here; the reader is referred to the following references for details: Lala 
[1985, pp. 103-107]; Pradhan [1986, pp. 664-667]; and Siewiorek [1982, pp. 
119-121, 169-175]. The term A-version programming was probably coined by 
Chen and Avizienis [1978] to liken the use of redundant software to A-modu- 
lar redundancy in hardware. To employ this technique, one writes two or more 
independent versions of the program and uses them in a voting-type arrange¬ 
ment. The heart of the matter is to discuss what we mean by independent soft¬ 
ware. Suppose we have three processors in a TMR arrangement, all running 
the same program. We assume that hardware and software failures are indepen¬ 
dent except for natural or manmade disasters that can affect all three computers 
(earthquake, fire, power failure, sabotage, etc.). In the case of software error, 
we would expect all three processors to err in the same manner and the voter to 
dutifully pass on the same erroneous output without detection of an error. (As 
was discussed previously, the only possible differences lie in the rare case in 
which the processors have different states.) To design independent programs to 
achieve software reliability, we need independent development groups (prob¬ 
ably in different companies), different design approaches, and perhaps even 
different languages. A simplistic example would be the writing of a program 
to find the roots of a quadratic equation, f(x), which has only real roots. The 
obvious approach would be to use the quadratic formula. A different design 
would be to use the theorem from the theory of equations, which states that if 
/(a) > 0 and if f(b) < 0, then at least one root lies between a and b. One could 
bisect the interval (a, b), check the sign of f([a + b |/2), and choose a new, 
smaller interval. Once iteration determines the first root, polynomial division 
can be used to determine the second root. We could ensure further diversity 
of the two approaches by coding one in C++ and the other in Ada. There are 
some difficulties in ensuring independent versions and in synchronizing differ¬ 
ent versions, as well as possible problems in comparing the outputs of different 
versions. 

It has been suggested that the following procedures be followed to ensure 
that we develop independent versions: 

1. Each programmer works from the same requirements. 

2. Each programmer or programming group works independently of the 
others, and communication between groups is not permitted except by 
passing messages (which can be edited or blocked) through the contract¬ 
ing organization. 
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3. Each version of the software is subjected to the same comprehensive 
acceptance tests. 

Dependence among errors in various versions can occur for a variety of 
reasons, such as the following: 

1. Identical misinterpretation of the requirements. 

2. Identical, incorrect treatment of boundary problems. 

3. Identical (or equivalent), incorrect designs for difficult portions of the 
problem. 

The technique of Anvers ion programming has been used or proposed for a 
variety of situations, such as the following: 

1. For Space Shuttle flight control software (discussed in Section 5.9.3). 

2. For the slat-and-flap control system of A310 Airbus Industry aircraft. 

3. For point switching, signal control, and traffic control in the Goteborg 
area of the Swedish State Railway. 

4. For nuclear reactor control systems (proposed by several authors). 

If the software versions are independent, we can use the same mathematical 
models as were introduced in Chapter 4. Consider the triple-modular redundant 
(TMR) system as an example. If we assume that there are three independent 
versions of the software and that the voting is perfect, then the reliability of 
the TMR system is given by 


R=pf( 3-2a) (5-57) 

where /?, is the identical reliability of each of the three versions of the software. 
We assume that all of the software faults are independent and affect only one 
of the three versions. 

Now, we consider a simple model of dependence. If we assume that there 
are two different ways in which common-mode dependencies exist, that is, 
requirements and program, then we can make the model given in Fig. 5.18. 
The reliability expression for this model is given by Shooman [1988]. 

R = PcmrPcmsipJO ~ 2 Pi)\ (5.58) 

This expression is the same mathematical formula as that of a TMR system 
with an imperfect voter (i.e., the common-mode errors play an analogous role 
to voter failures). 

The results of the above analysis will be more meaningful if we evalu¬ 
ate the effects of common-mode failures for a set of data. Although common 
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where 

Pj = 1 - Probability of an independent-mode-software fault 
Pcmr = 1 - Probability of a common-mode-requirements error 
Pans = 1 — Probability of a common-mode-software fault 

Figure 5.18 Reliability model of a triple-modular program including common-mode 
failures. 


mode data is hard to obtain, Chen and Avizienis [1978] and Pradhan [1986, p. 
665] report some practical data for 12 different sets of 3 independent programs 
written for solving a differential equation for temperature over a two-dimen¬ 
sional region. From these results, we deduce that the individual program reli¬ 
abilities were p, = 0.851, and substitution into Eq. (5.58) yields R = 0.94 for the 
TMR system. Thus the unreliability of the single program, (1 - 0 .851) =0.149, 
has been reduced to (1 - 0.94) = 0.06; the decrease in unreliability (0.149 /0.06) 
is a factor of 2.48 (the details of the computation are in Shooman [1990, pp. 
583-587]). This data did not include any common-mode failure information; 
however, the second example to be discussed does include this information. 

Some data gathered by Knight and Leveson [1986] discussed 27 different 
versions of a program, all of which were subjected to 200 acceptance tests. 
Upon acceptance, the program was subjected to one million test runs (see also 
McAllister and Vouk [1996]). 

Five of the programs tested without error, and the number of errors in 
the others ranged up to 9,656 for program number 22, which had a demon¬ 
strated pi = (1 - 9,656/1,000,000) = 0.990344. If there were no common-mode 
errors, substitution of this value for p, into Eq. (5.57) yields R = 0.99972. The 
improvement in unreliability, 1 - R, is 0.009651/0.00028, or a factor of 34.5. 

The number of common occurrences was also recorded for each error, allow¬ 
ing one to estimate the common-mode probability. By treating all the common 
mode situations as if they affected all the programs (a worst-case assump¬ 
tion), we have as the estimate of common mode (sum of the number of multi¬ 
ple failure occurrences)/(number of tests) = 1,255/1,000,000 = 0.001255. The 
probability of common-mode error is given by p cm rPcms = 1 ~ 0.001255 = 
0.998745. Substitution into Eq. (5.58) yields R = 0.99846. The improvement 
in 1 — R would now be from 0.009656 to 0.00154, and the improvement fac¬ 
tor is 6.27—still substantial, but a significant decrease from the 34.5 that was 
achieved without common-mode failures. (The details are given in Shooman 





266 SOFTWARE RELIABILITY AND RECOVERY TECHNIQUES 


[1990, pp. 582-587].) Another case is computed in which the initial value of 
Pi = (1 - 1,368/1,000,000) = 0.998632 is much higher. In this case, TMR 
produces a reliability of 0.99999433 for an improvement in unreliability by a 
factor of 241. However, the same estimate of common-mode failures reduces 
this factor to only 1.1! Clearly, such a small improvement factor would not be 
worth the effort, and either the common-mode failures must be reduced or other 
methods of improving the software reliability should be pursued. Although this 
data varies from program to program, it does show the importance of common¬ 
mode failures. When one wishes to employ redundant software, clearly one 
must exercise all possible cautions to minimize common-mode failures. Also, 
it is suggested that modeling be done at the outset of the project using the best 
estimates of independent and common-mode failure probabilities and that this 
continue throughout the project based on the test results. 

5.9.3 Space Shuttle Example 

One of the best known examples of hardware and software reliability is the 
Space Shuttle Orbiter flight control system. Once in orbit, the flight control 
system must maintain the vehicle’s altitude (rotations about 3 axes fixed in 
inertial space). Typically, one would use such rotations to lock onto a view of 
the earth below, travel along a line of sight to an object that the Space Shuttle 
is approaching, and so forth. The Space Shuttle uses a combination of vari¬ 
ous large and small gas jets oriented about the 3 axes to produce the necessary 
rotations. Orbit-change maneuvers, including the crucial reentry phase, are also 
carried out by the flight control system using somewhat larger orbit-maneuver¬ 
ing system (OMS) engines. There is much hardware redundancy in terms of 
sensors, various groupings of the small gas jets, and even the use of a com¬ 
bination of small gas jets for sustained firing should the OMS engines fail. In 
this section, we focus on the computer hardware and software in this system, 
which is shown in Fig. 5.19. 

There are five identical computers in the system, denoted as Hardware A, 
B, C, D, and E, and two different software systems, denoted by Software A 
and B. Computers A-D are connected in a voting arrangement with lockout 
switches at the inputs to the voter as shown. Each of these computers uses the 
complete software system—Software A. The four computers and associated 
software comprise the primary avionics software system (PASS), which is a 
two-out-of-four system. If a failure in one computer occurs and is confirmed 
by subsequent analysis and by disagreement with the other three computers as 
well as by other tests and telemetered data to Ground Control, this computer 
is then disconnected by the crew from the arrangement, and the remaining 
system becomes a TMR system. Thus this system will sustain two failures 
and still be functional rather than tolerating only a single failure, as is the case 
with an ordinary TMR system. Because of all the monitoring and test programs 
available in space and on the ground, it is likely that even after two failures, if a 
third malfunction occurred, it would still be possible to determine and switch 
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Figure 5.19 Hardware and software redundancy in the Space Shuttle’s avionics con¬ 
trol system. 


to the one remaining good computer. Thus the PASS has a very high level 
of hardware redundancy, although it is vulnerable to common-mode software 
failures in Software A. To guard against this, a backup flight control system 
(BFS) is included with a fifth computer and independent Software B. Clearly, 
Hardware E also supplies additional computer redundancy. In addition to the 
components described, there are many replicated sensors, actuators, controls, 
data buses, and power supplies. 

The computer self-test features detect 96% of the faults that could occur. 
Some of the built-in test and self-test features include the following: 

• Bus time-out tests: If the computer does not perform a periodic operation 
on the bus, and the timer has expired, the computer is labeled as failed. 

• Comparisons: Check sum is computed, and the computer is labeled as 
failed if there are two successive miscompares. 

• Watchdog timers: Processors set a timer, and if the timer completes its 
count before it is reset, the computer is labeled as failed and is locked 
out. 


To provide as much independence as possible, the two versions of the 
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software were developed by different organizations. The programs were both 
written in the HAL/S language developed by Intermetrics. The primary sys¬ 
tem was written by IBM Federal Systems Division, and the backup software 
was written by Rockwell and Draper Labs. Both Software A and Software B 
perform all the critical functions, such as ascent to orbit, descent from orbit, 
and reentry, but Software A also includes various noncritical functions, such 
as data logging, that are not included in the backup software. 

In addition to the redundant features of Software A and B, great emphasis 
has been applied to the life-cycle management of the Space Shuttle software. 
Although the software for each mission is unique, many of its components 
are reused from previous missions. Thus, if an error is found in the software 
for flight number 76, all previous mission software (all of which is stored) 
containing the same code is repaired and retested. Also, the reason why such 
an error occurred is analyzed, and any possibilities for similar mechanisms to 
cause errors in the rest of the code for this mission and previous missions are 
investigated. This great care, along with other features, resulted in the Space 
Shuttle software team being one of the first organizations to earn the highest 
rating of “level 5” when it was examined by the Software Engineering Institute 
of Carnegie Mellon University and judged with respect to the capability matu¬ 
rity model (CMM) levels. The reduction in error rate for the first 11 flights 
indicates the progress made and is shown in Fig. 5.20. An early reliability 
study of ground-based Space Shuttle software appears in Shooman [1984]; the 
model predicted the observed software error rate on flight number 1. 

The more advanced voting techniques discussed in Section 4.11 also apply 
to (V-version software. For a comprehensive discussion of voting techniques, 
see McAllister and Vouk [1996], 


5.10 ROLLBACK AND RECOVERY 
5.10.1 Introduction 

The term recovery technique includes a class of approaches that attempts to 
detect a software error and, in various ways, retry the computation. Suppose, for 
example, that the track of an aircraft on the display in an air traffic control system 
becomes corrupted. If the previous points on the path and the current input data 
are stored, then the computation of the corrupted points can be retried based on 
the stored values of the current input data. Assuming that no critical situation is in 
progress (e.g., a potential air collision), the slight delay in recomputing and filling 
in these points causes no harm. At the very worst, these few points may be lost, 
but the software replaces them by a projected flight path based on the past path 
data, and soon new actual points are available. This is also a highly acceptable 
solution. The worst outcomes that must be strenuously avoided are from those 
cases in which the errors terminate the track or cause the entire display to crash. 
Some designers would call such recovery techniques rollback because the com- 
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Figure 5.20 Errors found in the Space Shuttle’s software for the first 11 flights. The 
IBM Federal Systems Division (now United Space Alliance), wrote and maintained 
the onboard Space Shuttle control software, twice receiving the George M. Low Tro¬ 
phy, NASA’s excellence award for quality and productivity. This graph was part of the 
displays at various trade shows celebrating the awards. See Keller [1991] and Schnei- 
dewind LI992] for more details. 


putation backs up to the last set of previous valid data and attempts to reestab¬ 
lish computations in the problem interval and resume computations from there 
on. Another example that fits into this category is the familiar case in which one 
uses a personal computer with a word processing program. Suppose one issues a 
print command and discovers that the printer is turned off or the printer cable is 
disconnected. Most (but not all) modern software will give an error message and 
return control to the user, whereas some older programs lock the keyboard and 
will not recover once the cable is connected or the printer is turned on. The only 
recourse is to reboot the computer or to power down and then up again. Some¬ 
times, though, the last lines of code since the last manual or autosave operation 
are lost in either process. 

A11 of these techniques attempt to detect a software error and, in various ways, 
retry the computation. The basic assumption is that the problem is not a hard error 
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but a transient error. A transient software error is one due to a software fault that 
results only in a system error for particular system states. Thus, if we repeat the 
computation again and the system state has changed, there is a good probability 
that the error will not be repeated on the second trial. 

Recovery techniques are generally classified as forward or backward error- 
recovery techniques. The general philosophy of forward error recovery is to 
continue operation while knowing that there is an error in computation and 
correct for this error a little later. Techniques such as this work only in certain 
circumstances; for example, in the case of a tracking algorithm for an air traffic 
control system. In the case of backward error recovery, we wish to restart or 
roll back the computation process to some point before the occurrence of the 
error and restart the computation. In this section, we discuss four types of 
backward error recovery: 

1. Reboot/restart techniques 

2. Journaling techniques 

3. Retry techniques 

4. Checkpoint techniques 

For a more complete discussion of the topics introduced in this section, see 
Sieworek [1982] and Section 3.10. 

5.10.2 Rebooting 

The simplest—but weakest—recovery technique from the implementation 
standpoint is to reboot or restart the system. The process of rebooting is well 
known to users of PCs who, without thinking too much about it, employ it one 
or more times a week to recover from errors. Actually, this raises a philosophi¬ 
cal point: Is it better to have software that is well debugged and has very few 
errors that occur infrequently, or is having software with more residual errors 
that can be cleared by frequent rebooting also acceptable? The author remem¬ 
bers having a conversation with Ed Yourdon about an old computer when he 
was preparing a paper on reliability measurements [Yourdon, 1972]. Yourdon 
stated that a lot of computer crashes during operation were not recorded for 
the Burroughs B5500 computer (popular during the mid-1960s) because it was 
easy to reboot; the operator merely pushed the HALT button to stop the sys¬ 
tem and pushed the LOAD button to load a fresh version of the operating 
system. Lurthermore, Yourdon stated, “The restart procedure requires two to 
five minutes. This can be contrasted with most IBM System/360s, where a 
restart usually required fifteen to thirty minutes.” As a means of comparison, 
the author collected some data on reboot times that appears in Table 5.4. 

It would seem that a restarting time of under one minute is now considered 
acceptable for a PC. It is more difficult to quantify the amount of information 
that is lost when a crash occurs and a reboot is required. We consider three 
typical applications: (a) word processing, (b) reading and writing e-mail, and 
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TABLE 5.4 Typical Computer Reboot Times 


Computer 

Operating System 

Reboot Time 

IBM System/360" 

“OS-360” 

15-30 min 

Burroughs 5500° 

“Burroughs OS” 

2-5 min 

Digital PC 360/20 

Windows 3.1 

41 sec 

IBM Compatible Pentium ’90 

Windows ’95 

54 sec 

IBM Notebook Celeron 300 

Windows ’98 
+ Office 

80 sec 


°From Yourdon [1972], 


(c) a Web search. We assume that word processing is being done on a PC and 
that applications (b) and (c) are being conducted from home via modem con¬ 
nections and a high-speed line to a server at work (a more demanding situation 
than connection from a PC to a server via a local area network where all three 
facilities are in a work environment). As stated before, the loss during word 
processing due to a “lockup and reboot” depends on the text lost since the 
last manual or autosave operation. In addition, there is the lost time to reload 
the word processing software. These losses become significant when the crash 
frequency becomes greater than, say, one or two per month. Choosing small 
intervals between autosaves, keeping backup documents, and frequently print¬ 
ing out drafts of new additions to a long document are really necessities. A 
friend of the author’s who was president of a company that wrote and pub¬ 
lished technical documents for clients had a disastrous fire that destroyed all of 
his computer hardware, paper databases, and computer databases. Fortunately, 
he had about 70% of the material stored on tape and disks in another location 
that was unaffected, and it took almost a year to restore his business to full 
operation. The process of reading and writing e-mail is even more involved. 
A crash often severs the communication connection between the PC and the 
server, which must then be reestablished. Also, the e-mail program must be 
reentered. If a write operation was in progress, many e-mail programs do not 
save the text already entered. A Web search that locks up may require only 
the reissuing of the search, or it may require reacquisition of the server pro¬ 
viding the connection. Different programs provide a wide variety of behaviors 
in response to such crashes. Not only is time lost, but any products that were 
being read, saved, or printed during the crash are lost as well. 


5.10.3 Recovery Techniques 

A reboot operation is similar to recovery. However, reboot generally involves 
the action of a human operator who observes that something is wrong with 
the system and attempts to correct the problem. If this attempt is unsuccessful, 
the operator issues a manual reboot command. The term recovery generally 
means that the system itself senses operational problems and issues a reboot 
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command. In some cases, the software problem is more severe and a simple 
reboot is insufficient. Recovery may involve the reloading of some or all of 
the operating system. If this is necessary on a PC, the BIOS stored in ROM 
provides a basic means of communication to enable such a reloading. The most 
serious problems could necessitate a lower-level fix of the disk that stores the 
operating system. If we wish such a process to be autonomous, a special soft¬ 
ware program must be included that performs these operations in response to 
an “initiate recovery command.” Some of the clearest examples of such recov¬ 
ery techniques are associated with robotic space-research vehicles. 

Consider a robotic deep-space mission that loses control and begins to spin 
or tumble in space. The solar cells lose generating capacity, and the antennae no 
longer point toward Earth. The system must be designed from the start to recover 
from such a situation, as battery power provides a limited amount of time for 
such recovery to take place. Once the spacecraft is stabilized, the solar cells must 
be realigned with the Sun and the antennae must be realigned with Earth. This 
is generally provided by a small, highly secure kernel in the operating system 
that takes over in such a situation. In addition to hardware redundancy for all 
critical equipment, the software is generally subjected to a proof-of-correctness 
and an unusually high level of testing to ensure that it will perform its intended 
task. Many of NASA’s spacecraft have recovered from such situations, but some 
have not. The main point of this discussion is that reboot or recovery for all these 
examples must be contained in the requirements and planned for during the entire 
design, not added later in the process as almost an afterthought. 

5.10.4 Journaling Techniques 

Journaling techniques are slightly more complex and somewhat better than 
reboot or restart techniques. Such techniques are also somewhat quicker to 
employ than reboot or restart techniques since only a subset of the inputs must 
be saved. To employ these techniques requires that 

1 . a copy of the original database, disk, and filename be stored, 

2 . all transactions (inputs) that affect the data must be stored during exe¬ 
cution, and 

3. the process be backed up to the beginning and the computation be retried. 

Clearly, items (2) and (3) require a lot of storage; in practice, journaling 
can only be executed for a given time period, after which the inputs and the 
process must be erased and a new journaling time period created. The choice 
of the time period between journaling refreshes is an important design param¬ 
eter. Storage of inputs and processes is continuous during operation regardless 
of the time period. The commands to refresh the journaling process should 
not absorb too much of the operating time budget for the system. The main 
trade-off will be between the amount of storage and the amount of processing 
time for computational retry, which increases with the length of the journaling 
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period versus the impact of system overhead for journaling, which decreases as 
the interval between journaling refresh increases. It is possible that the storage 
requirements dominate and the optimum solution is to refresh when storage is 
filled up. 

These techniques of journaling are illustrated by an example. The Xerox 
Alto personal computer used an editor called Bravo. Journaling was used to 
recover if a computer crash occurred during an editing session. Most modern 
PC-based word processing systems use a different technique to avoid loss of 
data during a session. A timer is set, and every few minutes the data in the input 
buffer (representing new input data since the last manual or automatic save 
operation) is stored. The addition of journaling to the periodic storage process 
would ensure no data loss. (Perhaps the keystrokes that occurred immediately 
preceding a crash would be lost, but this at most would constitute the last word 
or the last command.) 


5.10.5 Retry Techniques 

Retry techniques are quicker than those discussed previously, but they are more 
complex since more redundant process-state information must be stored. Retry 
is begun immediately after the error is detected. In the case of transient errors, 
one waits for the transient to die out and then initiates retry, whereas in the 
case of hard errors, the approach is to reconfigure the system. In either case, the 
operation affected by the error is then retried, which requires a complete knowl¬ 
edge of the system state (kept in storage) before the operation was attempted. 
If the interrupted operation or the error has irrevocably modified some data, 
the retry fails. Several examples of retry operation are as follows: 

1. Disk controllers generally use disk-read reentry to minimize the number 
of disk-read errors. Consider the case of an MS-DOS personal computer 
system executing a disk-read command when an error is encountered. 
The disk-read operation is terminated, and the operator is asked whether 
he or she wishes to retry or abort. If the retry command is issued and 
the transient error has cleared, recovery is successful. However, if there 
is a hard error (e.g., a damaged floppy), retry will not clear the problem, 
and other processes must be employed. 

2. The Univac 1100/60 computer provided retry for macroinstructions after 
a failure. 

3. The IBM System/360 provided extensive retry capabilities, performing 
retries for both CPU and I/O operations. 

Sometimes, the cause of errors is more complex and the retry may not work. 
Consider the following example that puzzled and plagued the author for a few 
months. A personal computer with a bad hard-disk sector worked fine with all 
programs except with a particular word processor. During ordinary save oper- 
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ations, the operating system must have avoided the bad sector in storing disk 
files. However, the word processor automatically saved the workspace every 
few minutes. Small text segments in the workspace were fine, but medium¬ 
sized text segments were sometimes subjected to disk-read errors during the 
autosave operation but not during a normal (manually issued) save command. 
In response to the error message “abort or retry,” a simple retry response gen¬ 
erally worked the first time or, at worst, required an abort followed by a save 
command. With large text segments in the workspace, real trouble occurred: 
When a disk-read error was encountered during automatic saving, one or more 
paragraphs of text from previous word processing sessions that were stored in 
the buffer were often randomly inserted into the present workspace, thereby 
corrupting the document. This is a graphic example of a retry failure. The 
author was about to attempt to lock out the bad disk sectors so they would 
not be used; however, the problem disappeared with the arrival of the second 
release of the word processor. Most likely, the new software used a slightly 
different buffer autosave mechanism. 


5.10.6 Checkpointing 

One advantage of checkpoint techniques is that they can generally be imple¬ 
mented using only software, as contrasted with retry techniques that may 
require additional dedicated hardware in addition to the necessary software 
routines. Also in the case of retry, the entire time history of the system state 
during the relevant period is saved, whereas in checkpointing the time history 
of the system state is saved only at specific points (checkpoints); thus less 
storage is required. A major disadvantage of checkpointing is the amount and 
difficulty of the programming that is required to employ checkpoints. The steps 
in the checkpointing process are as follows: 

1. After the error is detected, recovery is initiated as soon as transient errors 
die out or, in the case of hard errors, the system is reconfigured. 

2. The system is rolled back to the most recent checkpoint, and the system 
state is set to the stored checkpoint state and the process is restarted. If the 
operation is successfully restored, the process continues, and only some 
time and any new input data during the recovery process are lost. If oper¬ 
ation is not restored, rollback to an earlier checkpoint can be attempted. 

3. If the interrupted operation or the error has irrevocably modified some 
data, the checkpoint technique fails. 

One better-developed example of checkpointing is within the Guardian oper¬ 
ating system used for the Tandem computer system. The system consists of a 
primary process that does all the work and a backup process that operates on 
the same inputs and is ready to take over if the primary process fails. At critical 
points, the primary process sends checkpoint messages to the backup process. 
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For further details on the Guardian operating system, the reader is referred to 
Siewiorek [1992, pp. 635-648]. Also, see the discussion in Section 3.10. 

Some comments are necessary with respect to the way customers generally 
use Tandem computer systems and the Guardian operating system: 

1. The initial interest in the Tandem computer system was probably due to 
the marketing value of the term “NonStop architecture” that was used 
to describe the system. Although proprietary studies probably exist, the 
author does not know of any reliability or availability studies in the open 
literature that compared the Tandem architecture with a competitive sys¬ 
tem such as a Digital Equipment VAX Cluster or an IBM system config¬ 
ured for high reliability. Thus it is not clear how these systems compared 
to the competition, although most users are happy. 

2. Once the system was studied by potential customers, one of the most 
important selling points was its modular structure. If the capacity of an 
existing Tandem system was soon to be exceeded, the user could simply 
buy additional Tandem machines, connect them in parallel, and easily 
integrate the expanded capacity with the existing system, which some¬ 
times could be accomplished without shutting down system operation. 
This was a clear advantage over competitors, so it was built into the 
basic design. 

3. The use of the Guardian operating system’s checkpointing features could 
easily be turned on or off in configuring the system. Many users turned 
this feature off because it slowed down the system somewhat, but more 
importantly because to use it required some complex system program¬ 
ming to be added to the application programs. Newer Tandem systems 
have made such programming easier to use, as discussed in Section 
3.10.1. 


5.10.7 Distributed Storage and Processing 

Many modern computer systems have a client-server architecture—typically, 
PCs or workstations are the clients, and the server is a more powerful pro¬ 
cessor with large disk storage attached. The clients and server are generally 
connected by local area networks (LANs). In fact, processing and data storage 
both tend to be decentralized, and several servers with their sets of clients are 
often connected by another network. In such systems, there is considerable the¬ 
oretical and practical interest in devising algorithms to synchronize the various 
servers and to prevent two or more users from colliding when they attempt to 
access data from the same tile. Even more important is the prevention of sys¬ 
tem lockup when one user is writing to a device and another user tries to read 
the device. For more information, the reader is referred to Bhargava [1987] 
and to the literature. 
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PROBLEMS 

5.1. Consider a software project with which you are familiar (past, in- 
progress, or planned). Write a few sentences or a paragraph describing 
the phases given in Table 5.1 for this project. Make sure you start by 
describing the project in succinct form. 

5.2. Draw an H-diagram similar to that shown in Fig. 5.1 for the software 
of problem 5.1. 

5.3. Flow well does the diagram of problem 5.2 agree with Eqs. (5.1 a-d)? 
Explain. 

5.4. Write a short version of a test plan for the project of problem 5.1. Include 
the number and types of tests for the various phases. (Note: A complete 
test plan will include test data and expected answers.) 

5.5. Would (or did) the development follow the approach of Figs. 5.2, 5.3, 
or 5.4? Explain. 

5.6. We wish to develop software for a server on the Internet that keeps a 
database of locations for new cars that an auto manufacturer is tracking. 
Assume that as soon as a car is assembled, a reusable electronic box is 
installed in the vehicle that remains there until the car is delivered to a 
purchaser. The box contains a global positioning system (GPS) receiver 
that determines accurate location coordinates from the GPS satellites and 
a transponder that transmits a serial number and these coordinates via 
another satellite to the server. The server receives these transponder sig¬ 
nals and stores them in a file. The server has a geographical database 
so that it can tell from the coordinates if each car is (a) in the manufac¬ 
turer’s storage lot, (b) in transit, or (c) in a dealer’s showroom or lot. 
The database is accessed by an Internet-capable cellular phone or any 
computer with Internet access [Stork, 2000, p. 18]. 

(a) How would you design the server software for this system? (Figs. 

5.2, 5.3, or 5.4?) 

(b) Draw an H-diagram for the software. 

5.7. Repeat problem 5.3 for the software in problem 5.6. 

5.8. Repeat problem 5.4 for the software in problem 5.6. 
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5.9. Repeat problem 5.5 for the software in problem 5.6. 

5.10. A component with a constant-failure rate of 4 x | () 5 is discussed in 
Section 5.4.5. 

(a) Plot the failure rate as a function of time. 

(b) Plot the density function as a function of time. 

(c) Plot the cumulative distribution function as a function of time. 

(d) Plot the reliability as a function of time. 

5.11. It is estimated that about 100 errors will be removed from a program dur¬ 
ing the integration test phase, which is scheduled for 12 months duration. 

(a) Plot the error-removal curve assuming that the errors will follow a 
constant-removal rate. 

(b) Plot the error-removal curve assuming that the errors will follow a 
linearly decreasing removal rate. 

(c) Plot the error-removal curve assuming that the errors will follow an 
exponentially decreasing removal rate. 

5.12. Assume that a reliability model is to be fitted to problem 5.11. The num¬ 
ber of errors remaining in the program at the beginning of integration 
testing is estimated to be 120. From experience with similar programs, 
analysts believe that the program will start integration testing with an 
MTTF of 150 hours. 

(a) Assuming a constant error-removal rate during integration, formulate 
a software reliability model. 

(b) Plot the reliability function versus time at the beginning of integra¬ 
tion testing—after 4, 8, and 12 months of debugging. 

(c) Plot the MTTF as a function of the integration test time, r. 

5.13. Repeat problem 5.12 for a linearly decreasing error-removal rate. 

5.14. Repeat problem 5.12 for an exponentially decreasing error-removal rate. 

5.15. Compare the reliability functions derived in problems 5.12, 5.13, and 
5.14 by plotting them on the same time axis for r = 0, t = 4, r = 8, and 
r = 12 months. 

5.16. Compare the MTTF functions derived in problems 5.12, 5.13, and 5.14 
by plotting them on the same time axis versus r. 

5.17. After 1 month of integration testing of a program, the MTTF = 10 hours, 
and 15 errors have been removed. After 2 months, the MTTF = 15 hours, 
and 25 total errors have been removed. 

(a) Assuming a constant error-removal rate, fit a model to this data 
set. Estimate the parameters by using moment-estimation techniques 
[Eqs. (5.47a, b)]. 

(b) Sketch MTTF versus development time t. 
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(c) How much integration test time will be required to achieve a 100- 
hour MTTF? How many errors will have been removed by this time 
and how many will remain? 

5.18. Repeat problem 5.17 assuming a linearly decreasing error-rate model 
and using Eqs. (5.49a, b). 

5.19. Repeat problem 5.17 assuming an exponentially decreasing error-rate 
model and using Eqs. (5.51) and (5.52). 

5.20. After 1 month of integration testing, 20 errors have been removed, the 
MTTF of the software is measured by testing it with the use of simulated 
operational data, and the MTTF = 10 hours. After 2 months, the MTTF 
= 20 hours, and 50 total errors have been removed. 

(a) Assuming a constant error-removal rate, fit a model to this data 
set. Estimate the parameters by using moment-estimation techniques 
[Eqs. (5.47a, b)]. 

(b) Sketch the MTTF versus development time r. 

(c) How much integration test time will be required to achieve a 60-hour 
MTTF? How many errors will have been removed by this time and 
how many will remain? 

(d) If we release the software when it achieves a 60-hour MTTF, sketch 
the reliability function versus time. 

(e) How long can the software operate, if released as in part (d) above, 
before the reliability drops to 0.90? 

5.21. Repeat problem 5.20 assuming a linearly decreasing error-rate model 
and using Eqs. (5.49a, b). 

5.22. Repeat problem 5.20 assuming an exponentially decreasing error-rate 
model and using Eqs. (5.51) and (5.52). 

5.23. Assume that the company developing the software discussed in problem 
5.17 has historical data for similar systems that show an average MTTF 
of 50 hours with a variance a 2 of 30 hours. The variance of the reliability 
modeling is assumed to be 20 hours. Using Eqs. (5.55) and (5.56a, b), 
compute the reliability function. 

5.24. Assume that the model of Fig. 5.18 holds for three independent ver¬ 
sions of reliable software. The probability of error for 10,000 hours of 
operation of each version is 0.01. Compute the reliability of the TMR 
configuration assuming that there are no common-mode failures. Recom¬ 
pute the reliability of the TMR configuration if 1 % of the errors are due 
to common-mode requirement errors and 1% are due to common-mode 
software faults. 




NETWORKED SYSTEMS 
RELIABILITY 


6.1 INTRODUCTION 

Many physical problems (e.g., computer networks, piping systems, and power 
grids) can be modeled by a network. In the context of this chapter, the word 
network means a physical problem that can be modeled as a mathematical 
graph composed of nodes and links (directed or undirected) where the branches 
have associated physical parameters such as flow per minute, bandwidth, or 
megawatts. In many such systems, the physical problem has sources and sinks 
or inputs and outputs, and the proper operation is based on connection between 
inputs and outputs. Systems such as computer or communication networks have 
many nodes representing the users or resources that desire to communicate and 
also have several links providing a number of interconnected pathways. These 
many interconnections make for high reliability and considerable complexity. 
Because many users are connected to such a network, a failure affects many 
people; thus the reliability goals must be set at a high level. 

This chapter focuses on computer networks. It begins by discussing the sev¬ 
eral techniques that allow one to analyze the reliability of a given network, after 
which the more difficult problem of optimum network design is introduced. 
The chapter concludes with a brief introduction to one of the most difficult 
cases to analyze—where links can be disabled because of two factors: (a) link 
congestion (a situation in which flow demand exceeds flow capacity and a link 
is blocked or an excessive queue builds up at a node), and (b) failures from 
broken links. 

A new approach to reliability in interconnected networks is called surviv¬ 
ability analysis [Jia and Wing, 2001]. The concept is based on the design of 
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a network so it is robust in the face of abnormal events—the system must 
survive and not crash. Recent research in this area is listed on Jeannette M. 
Wing’s Web site [Wing, 2001]. 

The mathematical techniques used in this chapter are properties of mathe¬ 
matical graphs, tie sets, and cut sets. A summary of the relevant concepts is 
given in Section B2.7, and there is a brief discussion of some aspects of graph 
theory in Section 5.3.5; other concepts will be developed in the body of the 
chapter. The reader should be familiar with these concepts before continuing 
with this chapter. For more details on graph theory, the reader is referred to 
Shooman [1983, Appendix C]. There are of course other approaches to net¬ 
work reliability; for these, the reader is referred to the following references: 
Frank [1971], Van Slyke [1972, 1975], and Colbourn [1987, 1993, 1995]. It 
should be mentioned that the cut-set and tie-set methods used in this chapter 
apply to reliability analyses in general and are employed throughout reliabil¬ 
ity engineering; they are essentially a theoretical generalization of the block 
diagram methods discussed in Section B2. Another major approach is the 
use of fault trees, introduced in Section B5 and covered in detail in Dugan 
[1996]. 

In the development of network reliability and availability we will repeat for 
clarity some of the concepts that are developed in other chapters of this book, 
and we ask for the reader’s patience. 


6.2 GRAPH MODELS 

We focus our analytical techniques on the reliability of a communication net¬ 
work, although such techniques also hold for other network models. Suppose 
that the network is composed of computers and communication links. We rep¬ 
resent the system by a mathematical graph composed of nodes representing the 
computers and edges representing the communications links. The terms used to 
describe graphs are not unique; oftentimes, notations used in the mathematical 
theory of graphs and those common in the application fields are interchange¬ 
able. Thus a mathematics textbook may talk of vertices and arcs; an electrical- 
engineering book, of nodes and branches; and a communications book, of sites 
and interconnections or links. In general, these terms are synonymous and used 
interchangeably. 

In the most general model, both the nodes and the links can fail, but here 
we will deal with a simplified model in which only the links can fail and the 
nodes are considered perfect. In some situations, communication can go only 
in one direction between a node pair; the link is represented by a directed edge 
(an arrowhead is added to the edge), and one or more directed edges in a graph 
result in a directed graph (digraph). If communication can occur in both direc¬ 
tions between two nodes, the edge is nondirected, and a graph without any 
directed nodes is an ordinary graph (i.e., nondirected, not a digraph). We will 
consider both directed and nondirected graphs. (Sometimes, it is useful to view 
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Figure 6.1 A four-node graph representing a computer or communication network. 

a nondirected graph as a special case of a directed graph in which each link 
is represented by two identical parallel links, with opposite link directions.) 

When we deal with nondirected graphs composed of E edges and N nodes, 
the notation G(N, E) will be used. A particular node will be denoted as n, and 
a particular edge denoted as ej. We can also identify an edge by naming the 
nodes that it connects; thus, if edge j is between nodes s and t, we may write 
Cj = (n s , n,) = e(s,t)- One also can say that edge j is incident on nodes s and 
t. As an example, consider the graph of Fig. 6.1, where G(N = 4,E = 6 ). The 
nodes n\, m, « 3 , and n 4 are a, b, c, and d. Edge 1 is denoted by e\ = e(n\, 112 ) =s 
(. a,b ), edge 2 by <21 = e(ii 2 , nf) = (b,c), and so forth. The example of a network 
graph shown in Fig. 6.1 has four nodes ( a , b, c, d) and six edges (1, 2, 3, 4, 
5, 6 ). The edges are undirected (directed edges have arrowheads to show the 
direction), and since in this particular example all possible edges between the 
four nodes are shown, it is called a complete graph. The total number of edges 
in a graph with n nodes is the number of combinations of n things taken two 
at a time = n\ /[(2 \)(n —2)!]. In the example of Fig. 6.1, the total number of 
edges in 4!/[(2!)(4 —2)!] = 6. 

In formulating the network model, we will assume that each link is either 
good or bad and that there are no intermediate states. Also, independence of 
link failures is assumed, and no repair or replacement of failed links is con¬ 
sidered. In general, the links have a high reliability, and because of all the 
multiple (redundant) paths, the network has a very high reliability. This large 
number of parallel paths makes for high complexity; the efficient calculation 
of network reliability is a major problem in the analysis, design, or synthesis 
of a computer communication network. 


6.3 DEFINITION OF NETWORK REFIABIFITY 

In general, the definition of reliability is the probability that the system oper¬ 
ates successfully for a given period of time under environmental conditions 
(see Appendix B). We assume that the systems being modeled operate con¬ 
tinuously and that the time in question is the clock time since the last failure 
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or restart of the system. The environmental conditions include not only tem¬ 
perature, atmosphere, and weather, but also system load or traffic. The term 
successful operation can have many interpretations. The two primary ones are 
related to how many of the n nodes can communicate with each other. We 
assume that as time increases, a number of the m links fail. If we focus on 
communication between a pair of nodes where s is the source node and t is 
the target node, then successful operation is defined as the presence of one or 
more operating paths between s and t. This is called the two-terminal problem, 
and the probability of successful communication between s and I is called two- 
terminal reliability. If successful operation is defined as all nodes being able 
to communicate, we have the all-terminal problem, for which it can be stated 
that node s must be able to communicate with all the other n - 1 nodes, since 
communication between any one node 5 and all others nodes, t\ , to, ■ ■ ■ ,t r , 1 , 
is equivalent to communication between all nodes. The probability of success¬ 
ful communication between node s and nodes t\, ..., 1 is called the all¬ 

terminal reliability. 

In more formal terms, we can state that the all-terminal reliability is the 
probability that node n, can communicate with node nj for all pairs /),«, (where 
i 7 ^ j). We wish to show that this is equivalent to the proposition that node s 
can communicate with all other nodes t\ = no, f 2 = n 3 ,..., 1 = n n . Choose 

any other node n x (where x # 1). By assumption, n x can communicate with s 
because 5 can communicate with all nodes and communication is in both direc¬ 
tions. However, once n x reaches s, it can then reach all other nodes because 
s is connected to all nodes. Thus all-terminal connectivity for x = 1 results in 
all-terminal connectivity for x ^ 1 , and the proposition is proved. 

In general, reliability, R, is the probability of successful operation. In the 
case of networks, we are interested in all-terminal reliability, R^: 

R. d u = /'(that all n nodes are connected) (6.1) 

or the two-terminal reliability: 

R st = P(that nodes .v and t are connected) (6.2) 

Similarly, ^-terminal reliability is the probability that a subset of k nodes 2 < 
k < n) are connected. Thus we must specify what type of reliability we are 
discussing when we begin a problem. 

We stated previously that repairs were not included in the analysis of net¬ 
work reliability. This is not strictly true; for simplicity, no repair was assumed. 
In actuality, when a node-switching computer or a telephone communications 
line goes down, each is promptly repaired. The metric used to describe a 
repairable system is availability, which is defined as the probabilty that at any 
instant of time t, the system is up and available. Remember that in the case 
of reliability, there were no failures in the interval 0 to t. The notation is Ait), 
and availability and reliability are related as follows by the union of events: 
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A(t ) = P(no failure in interval 0 to t + 1 failure and 

1 repair in interval 0 to t + 2 failures and 

2 repairs in interval 0 to t H-) (6.3) 

The events in Eq. (6.3) are all mutually exclusive; thus Eq. (6.3) can be 
expanded as a sum of probabilities: 

A(t ) = Pino failure in interval 0 to t) 

+ Pi 1 failure and 1 repair in interval 0 to t) 

+ P( 2 failures and 2 repairs in interval 0 to t) H- (6.4) 

Clearly, 

• The brst term in Eq. (6.4) is the reliability, R(t) 

• A(t) = R(t ) = 1 at r = 0 

• For t > 0, A(t) > R(t) 

• R(t) -a 0 as t —> °o 

• It is shown in Appendix B that Ait) —» A ss as t —» °o and, as long as repair 
is present, A ss > 0 

Availability is generally derived using Markov probability models (see 
Appendix B and Shooman [1990]). The result of availability derivations for 
a single element with various failure and repair probability distributions can 
become quite complex. In general, the derivations are simplibed by assuming 
exponential probability distributions for the failure and repair times (equiv¬ 
alent to constant-failure rate, X, and constant-repair rate, /x). Sometimes, the 
mean time to failure (MTTF) and the mean time to repair (MTTR) are used 
to describe the repair process and availability. In many cases, the terms mean 
time between failure (MTBF) and mean time between repair (MTBR) are used 
instead of MTTF and MTTR. For constant-failure and -repair rates, the mean 
times become MTBF = 1 /X and MTBR = 1 /ji. The solution for Ait) has an 
exponentially decaying transient term and a constant steady-state term. After a 
few failure repair cycles, the transient term dies out and the availability can be 
represented by the simpler steady-state term. For the case of constant-failure 
and -repair rates for a single item, the steady-state availability is given by the 
equation that follows (see Appendix B). 

= jx/{\ + lx) = MTBF/(MTBF + MTBR) (6.5) 

Since the MTBF » MTBR in any well-designed system, A ss is close to 
unity. Also, alternate debnitions for MTTF and MTTR lead to slightly different 
but equivalent forms for Eq. (6.5) (see Kershenbaum [1993].) 

Another derivation of availability can be done in terms of system uptime, 
U(t), and system downtime, D(t), resulting in the following different formula 
for availability: 
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A ss = U(t)/[U(t) + D(t)\ (6.6) 

The formulation given in Eq. (6.6) is more convenient than that of Eq. (6.5) 
if we wish to estimate A ss based on collected field data. In the case of a com¬ 
puter network, the availability computations can become quite complex if the 
repairs of the various elements are coupled, in which case a single repairman 
might be responsible for maintaining, say, two nodes and five lines. If sev¬ 
eral failures occur in a short period of time, a queue of failed items wait¬ 
ing for repairs might build up and the downtime is lengthened, and the term 
“repairman-coupled” is used. In the ideal case, if we assume that each element 
in the system has its own dedicated repairman, we can guarantee that the ele¬ 
ments are decoupled and that the steady-state availabilities can be substituted 
into probability expressions in the same way as reliabilities are. In a practi¬ 
cal case, we do not have individual repairmen, but if the repair rate is much 
larger than the failure rate of the several components for which the repairman 
supports, then approximate decoupling is a good assumption. Thus, in most 
network reliability analyses there will be no distinction made between reli¬ 
ability and availability; the two terms are used interchangeably in the network 
field in a loose sense. Thus a reliability analyst would make a combinatorial 
model of a network and insert reliability values for the components to calculate 
system reliability. Because decoupling holds, he or she would substitute com¬ 
ponent availabilities in the same model and calculate the system availability; 
however, a network analyst would perform the same availability computation 
and refer to it colloquially as “system reliability.” For a complete discussion 
of availability, see Shooman [1990]. 


6.4 TWO-TERMINAL RELIABILITY 

The evaluation of network reliability is a difficult problem, but there are several 
approaches. For any practical problem of significant size, one must use a com¬ 
putational program. Thus all the techniques we discuss that use a “pencil-paper- 
and-calculator” analysis are preludes to understanding how to write algorithms 
and programs for network reliability computation. Also, it is always valuable to 
have an analytical solution of simpler problems for use to test reliability com¬ 
putation programs until the user becomes comfortable with such a program. 
Since two-terminal reliability is a bit simpler than all-terminal reliability, we 
will discuss it first and treat all-terminal reliability in the following section. 

6.4.1 State-Space Enumeration 

Conceptually, the simplest means of evaluating the two-terminal reliability of 
a network is to enumerate all possible combinations where each of the e edges 
can be good or bad, resulting in 2'' combinations. Each of these combinations of 
good and bad edges can be treated as an event £,. These events are all mutually 
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exclusive (disjoint), and the reliability expression is simply the probability of 
the union of the subset of these events that contain a path between s and t. 

R st = P(E l+ E 2 + E 3 ---) (6.7) 

Since each of these events is mutually exclusive, the probability of the union 
becomes the sum of the individual event probabilities. 

Rs, = P(E l ) + P(E 2 ) + P(E 3 ) +■■■ (6.8) 

[Note that in Eq. (6.7) the symbol + stands for union (U), whereas in Eq. (6.8), 
the + represents addition. Also throughout this chapter, the intersection of x and 
y (x fl y) is denoted by x-y, or just xy.] 

As an example, consider the graph of a complete four-node communication 
network that is shown in Fig. 6.1. We are interested in the two-terminal reli¬ 
ability for node pair a and b; thus s = a and t = b. Since there are six edges, 
there are 2 6 = 64 events associated with this graph, all of which are presented 
in Table 6.1. The following definitions are used in constructing Table 6.1: 

Ej = the event i 
j = the success of edge j 
j' = the failure of edge j 

The term good means that there is at least one path from a to b for the given 
combination of good and failed edges. The term bad, on the other hand, means 
that there are no paths from a to b for the given combination of good and failed 
edges. The result—good or bad—is determined by inspection of the graph. 

Note that in constructing Table 6.1, the following observations prove help¬ 
ful: Any combination where edge 1 is good represents a connection, and at 
least three edges must fail (edge 1 plus two others) for any event to be bad. 

Substitution of the good events from Table 6.1 into Eq. (6.8) yields the 
two-terminal reliability from a to b: 

R ab = [P(Ei)] + [P(E 2 ) + • • • + P(E 7 )] + [P(E S ) + P(E 9 ) + • • • + P(E 22 )\ 

+ [P(E 2 3 ) + P(E 2 4 ) + • • • + P(E m ) + P(E 37 ) + ■■■ + P(E 42 )\ 

+ [P(E 43 ) + P(E 44 ) + ■■■ + P(E A1 ) + P(E 50 ) + P(E 56 )] + [P{E S 8 )] (6.9) 

The first bracket in Eq. (6.9) has one term where all the edges must be good, 
and if all edges are identical and independent, and they have a probability of 
success of p, then the probability of event E \ is //’. Similarly, for the second 
bracket, there are six events of probability qp 5 where the probability of failure 
q = 1 - p, etc. Substitution in Eq. (6.9) yields a polynomial in p and q: 


Rab = P 6 + 6 qp 5 + 15 q 2 p 4 + 18 q 3 p 3 + 7 q 4 p 2 + q 5 p 


( 6 . 10 ) 
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TABLE 6.1 The Event-Space for the Graph of 
Fig. 6.1 (s = a, t = b) 



/ 6 \ 

6 ! 

No failures: 

V° / 

= = 1 


0 !6! 

Ei = 123456 


Good 


/ 6 \ 

6 ! 

One failure: 

(.) 

= , , =6 


15 


E 2 = 1'23456 

Good 

E 3 = 12'3456 

Good 

E 4 = 123'456 

Good 

E 5 = 1234'56 

Good 

E 6 = 12345'6 

Good 

E 7 = 123456' 

Good 

Two failures: 

f 6 U 61 =15 
\2 J 214! 


Eg = 1'2'3456 

Good 

Eg = 1'23'456 

Good 

E io = 1'234'56 

Good 

En = l'2345'6 

Good 

E 12 = 1'23456' 

Good 

Ei 3 = 12'3'456 

Good 

Em = 12'34'56 

Good 

E is = 12'345'6 

Good 

Ei 6 = 12'345 6 ' 

Good 

En = 123'4'56 

Good 

E ig = 123'45'6 

Good 

E 19 = 123'456' 

Good 

E 20 = 1234'5'6 

Good 

E 2 1 = 1234'56' 

Good 

E 22 = 12345'6' 

Good 


Continued . . . 

Three failures: 

(') =313!= 20 

E 23 = 1234'5'6' 

Good 

E 24 = 123'45'6' 

Good 

E 25 = 123'4'56' 

Good 

E 26 = 123'4'5'6 

Good 

E 27 = 12'345'6' 

Good 

E 28 = 12'34'56' 

Good 

E 29 = 12'34'5'6 

Good 

E 30 = 12'3'456' 

Good 

E 3 i = 12'3'45'6 

Good 

E 32 = 12'3'4'56 

Good 
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TABLE 6.1 (Continued) 

£33 = 1'2345'6' 


Good 


£34 = 1'234'56' 


Good 


£35 = 1'234'5'6' 


Bad 


£ 36 = 1'2'3456' 


Bad 


£37 = 1'2'345'6 


Good 


£38 = l'2'34'56 


Good 


£39 = 1'23'456' 


Good 


£40 = 1'23'45'6 


Good 


£41 = 1'23'4'56 


Good 


£42 = 1'2'3'456 


Good 



/ 6 \ 

6! 


Four failures: 



= 15 


V 4 / 

“ 4!2! 

£43 = 123'4'5'6' 


Good 


£44 = 12'34'5'6' 


Good 


£45 = 12'3'45'6' 


Good 


£ 46 = 12'3'4'56' 


Good 


£47 = 12'3'4'5'6 


Good 


£ 48 = 1'234'5'6' 


Bad 


£49 = 1'23'45'6' 


Bad 


£50 = 1'23'4'56' 


Good 


£51 = 1'23'4'5'6 


Bad 


£52 = 1'2'345'6' 


Bad 


£53 = 1'2'34'56' 


Bad 


£54 = 1'2'34'5'6 


Bad 


£55 = 1'2'3'456' 


Bad 


£ 56 = 1'2'3'45'6 


Good 


£57 = 1'2'3'4'56 


Bad 



Continued . 



/ 6\ 

6! 


Five failures: 


= 

= 6 


V 5 J 

5! 1! 


£ 58 = 12'3'4'5'6' 


Good 


£59 = 1'23'4'5'6' 


Bad 


E 60 = 1'2'34'5'6' 


Bad 


£ei = 1'2'3'45'6' 


Bad 


E 62 = 1'2'3'4'56' 


Bad 


£ 63 = 1'2'3'4'5'6 


Bad 



/ 6\ 

6! 


Six failures: 



= 1 

y 6 ) 

“ 6!0! 


£ 64 = 1'2'3'4'5'6' 


Bad 



Substitutions such as those in Eq. (6.10) are prone to algebraic mistakes; as 
a necessary (but not sufficient) check, we evaluate the polynomial for p = 1 
and q = 0, which should yield a reliability of unity. Similarly, evaluating the 
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polynomial for p = 0 and q = 1 should yield a reliability of 0. (Any network 
has a reliability of unity regardless of its topology if all edges are perfect; it 
has a reliability of 0 if all its edges have failed.) 

Numerical evaluation of the polynomial for p = 0.9 and q = 0.1 yields 

R ab = 0.9 6 + 6(0.1 )(0.9) 5 + 15(0.1) 2 (0.9) 4 + 18(0.1) 3 (0.9) 3 

+ 7(0.1) 4 (0.9) 2 + (0.1 ) 5 (0.9) (6.11a) 

R ab = 0.5314 + 0.35427 + 0.0984 + 0.0131 + 5.67 x l(T 4 + 9x 10“ 6 (6.11b) 
R ab = 0.997848 (6.11c) 


Usually, event-space-reliability calculations require much effort and time even 
though the procedure is clear. The number of events builds up exponentially 
as 2 e . For e = 10, we have 1,024 terms, and if we double the <?. there are over 
a million terms. However, we seek easier methods. 


6.4.2 Cut-Set and Tie-Set Methods 

One can reduce the amount of work in a network reliability analysis below the 
2 e complexity required for the event-space method if one focuses on the min¬ 
imal cut sets and minimal tie sets of the graph (see Appendix B and Shooman 
[1990, Section 3.6.5]). The tie sets are the groups of edges that form a path 
between s and t. The term minimal implies that no node or edge is traversed 
more than once, but another way of defining this is that minimal tie sets have 
no subsets of edges that are a tie set. If there are i tie sets between 5 and t, 
then the reliability expression is given by the expansion of 

R st = P(T l + T 2 +-+T i ) (6.12) 


Similarly, one can focus on the minimal cut sets of a graph. A cut set is a 
group of edges that break all paths between .v and t when they are removed 
from the graph. If a cut set is minimal, no subset is also a cut set. The reliability 
expression in terms of the j cut sets is given by the expansion of 

R st = i-P(C l + C 2 + - + Cj) (6.13) 


We now apply the above theory to the example given in Fig. 6.1. The min¬ 
imal cut sets and tie sets are found by inspection for s = a and t = b and are 
given in Table 6.2. 

Since there are fewer cut sets, it is easier to use Eq. (6.13) rather than Eq. 
(6.12); however, there is no general rule for when j < i or vice versa. 
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TABLE 6.2 Minimal Tie Sets and Cut Sets for the 
Example of Fig. 6.1 (s =a,t = b ) 


Tie Sets 

Cut Sets 

T i = 1 

Ci = 1'4'5' 

T 2 = 52 

C 2 = 1'6'2' 

T 3 =46 

C 3 = 1'5'6'3' 

74 = 234 

C 4 = 1'2'3'4' 

T 5 = 536 

— 


1 - P(C 1 + C 2 + C 3 + C 4 ) (6.14a) 

1 - P(l'4'5' + 1'6'2' + 1'5'3'6' + 1'2'3'4') (6.14b) 

1 - [^(l'4 r 5') + P(l'6'2') + P(l'5'3'6') + / > (1'2'3 , 4')] 

+ [7>(1'2'4'5'6') + Pd'3'4'5'6') + P(l'2'3'4'5') 

+ 7>(1'2'3'5'6') + P( 1'2'3'4'6') + P(l'2'3'4'5'6')] 

- [P(V2'3'4'5'6') + P( 1'2'3'4'5'6') + P{ 1'2'3'4'5'6') 

+ .PCl'2'3'4'5'6')] + [7>(1'2'3'4'5'6')] (6.14c) 

The expansion of the probability of a union of events that occurs in Eq. (6.14) 
is often called the inclusion-exclusion formula. [See Eq. (All).] 

Note that in the expansions in Eqs. (6.12) or (6.13), ample use is made of 
the theorems x • x = x and x+x =x (see Appendix A). For example, the second 
bracket in Eq. (6.14c) has as its second term / 5 (ctc 3 ) = P([1'4'5'] [1'5'6'3']) = 
P( 1'3'4'5'6'), since 1' ■ 1' = l' and 5' • 5' = 5'. The reader should note that 
this point is often overlooked (see Appendix D, Section D3), and it may or 
may not make a numerical difference. 

If all the edges have equal probabilities of failure = q and are independent, 
Eq. (6.14c) becomes 

R ab = 1 - [2 q 3 + 2q 4 \ + [5q 5 + q 6 ] - [4q 6 ] + [q 6 ] 

Rah = 1-2 q 3 - 2q 4 + 5q 5 - 2q 6 (6.15) 

The necessary checks, R a h = 1 for q = 0 and R, t i, = 0 for q = 1, are valid. 

For q =0.1, Eq. (6.15) yields 

R ab =1 - 2 x 0.1 3 - 2 x 0.1 4 + 5 x 0.1 5 - 2x o.l 6 = 0.997848 (6.16) 

Of course, the result of Eq. (6.16) is identical to Eq. (6.11c). If we substitute 
tie sets into Eq. (6.12), we would get a different though equivalent expression. 

The expansion of Eq. (6.13) has a complexity of 2 ; and is more complex 
than Eq. (6.12) if there are more cut sets than tie sets. At this point, it would 


Rab = 
Rab = 
Rab = 
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seem that we should analyze the network and see how many tie sets and cut 
sets exist between .v and t, and assuming that i and j are manageable numbers 
(as is the case in the example to follow), then either Eq. (6.12) or Eq. (6.13) 
is feasible. In a very large problem (assume i < j < e), even 2' is too large 
to deal with, and the approximations of Section 6.4.3 are required. Of course, 
large problems will utilize a network reliability computation program, but an 
approximation can be used to check the program results or to speed up the 
computation in a truly large problem [Colbourn, 1987, 1993; Shooman, 1990]. 

The complexity of the cut-set and tie-set methods depends on two factors: 
the order of complexity involved in finding the tie sets (or cut sets) and the 
order of complexity for the inclusion-exclusion expansion. The algorithms for 
finding the number of cut sets are of polynomial complexity; one discussed in 
Shier [1991, p. 63] is of complexity order 0(n + e + ie). In the case of cut sets, 
the finding algorithms are also of polynomial complexity, and Shier [1991, p. 
69] discusses one that is of order 0([n + e]j). Observe that the notation 0( f ) 
is called the order of/or “big O of/.” For example, if/ = 5x 3 + I (Xv 2 + 12, the 
order of/would be the dominating term in/as x becomes large, which is 5x 3 . 
Since the constant 5 is a multiplier independent of the size of x, it is ignored, 
so 0( 5x 3 + 10.r 2 + 12) = x 3 (see Rosen [1999, p. 105]). 

In both cases, the dominating complexity is that of expansion for the 
inclusion-exclusion algorithm for Eqs. (6.12) and (6.13), where the orders of 
complexity are exponential, 0(2‘) or 0(2 J ) [Colbourn, 1987, 1993]. This is 
the reason why approximate methods are discussed in the next two sections. 
In addition, some of these algorithms are explored in the problems at the end 
of this chapter. 

If we examine Eqs. (6.12) and (6.13), we see that the complexity of 
these expressions is a function of the cut sets or tie sets, the number of 
edges in the cut sets or tie sets, and the number of “brackets” that must be 
expanded (the number of terms in the union of cut sets or tie sets—i.e., in 
the inclusion-exclusion formula). We can approximate the cut-set or tie-set 
expression by dropping some of the less-significant brackets of the expansion, 
by dropping some of the less-significant cut sets or tie sets, or by both. 

6.4.3 Truncation Approximations 

The inclusion-exclusion expansions of Eqs. (6.12) and (6.13) sometimes yield 
a sequence of probabilities that decrease in size so that many of the higher- 
order terms in the sequence can be neglected, resulting in a simpler approxi¬ 
mate formula. These terms are products of probabilities, so if these probabil¬ 
ities are small, the higher-order product terms can be neglected. In the case 
of tie-set probabilities, this is when the probabilities of success are small—the 
so-called low-reliability region, which is not the region of practical interest. 
Cut-set analysis is preferred since this is when the probabilities of failure are 
small—the so-called high-reliability region, which is really the region of prac¬ 
tical interest. Thus cut-set approximations are the ones most frequently used 
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in practice. If only the first bracket in Eq. (6.14c) is retained in addition to the 
unity term, one obtains the same expression that would have ensued had the 
cuts been disjoint (but they are not). Thus we will call the retention of only 
the first two terms the disjoint approximation. 

In Shooman [1990, Section 3.6.5], it is shown that a disjoint cut-set approx¬ 
imation is a lower bound. For the example of Fig. 6.1, we obtain Eq. (6.17) 
for the disjoint approximation, and assuming q =0.1: 

R ah > 1 - [2<? 3 + 2 q 4 ] = 1 - 0.002 - 0.0002 = 0.9978 (6.17) 


which is quite close to the exact value given in Eq. (6.16). If we include the 
next bracket in Eq. (6.14c), we get a closer approximation at the expense of 
computing [j + ( J 2 )] = [./(./ - l)/2] terms. 

R ah <l-[2q 3 + 2q 4 ] + [5q 5 + q 6 ] 

= 0.9978 + 5 x 0.1 5 + 0.1 6 = 0.997851 (6.18) 


Equation (6.18) is not only an approximation but an upper bound. In fact, as 
more terms are included in the inclusion-exclusion formula, we obtain a set of 
alternating bounds (see Shooman [1990, Section 3.6.5]). Note that Eq. (6.17) 
is a sharp lower bound and that Eq. (6.18) is ever sharper, but both equa¬ 
tions effectively bracket the exact result. Clearly, the sharpness of these bounds 
increases as q-, = 1 - p, decreases for the i edges of the graph. 

0.997800 < R a b < 0.997851 (6.19) 


We can approximate R ( ,b by the midpoint of the two bounds. 


0.997800 + 0.997851 

R(lb ~ Z 


0.9978255 


( 6 . 20 ) 


The accuracy of the preceding approximation can be evaluated by examining 
the deviation in the computed probability of failure F a b = 1 - R a i>- In the region 
of high reliability, all the values of R ab are very close to unity, and differences 
are misleadingly small. Thus, as our error criterion, we will use 


% error = 


|F ab (estimate) - F ah (ex act)| 
F ab (ex act) 


x 100 


( 6 . 21 ) 


Of course, the numerator of Eq. (6.21) would be the same if we took the dif¬ 
ferences in the reliabilities. Evaluation of Eq. (6.21) for the results given in 
Eqs. (6.16) and (6.20) yields 


% error = 


|0.0021745 - 0.002152| 


x 100% = 1.05 


( 6 . 22 ) 


0.002152 
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Clearly, this approximation is good for this example and will be good in most 
cases. Of course, in approximate evaluations of a large network, we do not 
know the exact reliability, but we can still approximate Eq. (6.21) by using 
the difference between the two-term and three-term approximations. For the 
numerator and the average of the denominator: 


% error = 


|0.002200 - 0.002149| 
0.0021745 


x 100% = 2.35 


(6.23) 


A moment’s reflection leads to the conclusion that the highest-order approx¬ 
imation will always be the closest and should be used in the denominator of 
an error bound. The numerator, on the other hand, should be the difference 
between the two highest-order terms. Thus, for our example, 


% error = 


|0.002200 - 0.002149| 
0.0021749 


x 100% = 2.37 


(6.24) 


Therefore, a practical approach in designing a computer analysis program is 
to ask the analyst to input the accuracy he or she desires, then to compute a 
succession of bounds involving more and more terms in the expansion of Eq. 
(6.13) at each stage. An equation similar to Eq. (6.24) would be used for the 
last two terms in the computation to determine when to stop computing alter¬ 
nating bounds. The process truncates when the error approximation yields an 
estimated value that has a smaller error bound that that of the required error. 
We should take note that the complexity of the “one-at-a-time” approximation 
is of order j (number of cut sets) and that of the “two-at-a-time” approxima¬ 
tion is of order j 2 . Thus, even if the error approximation indicates that more 
terms are needed, the complexity will only be of order j ] or perhaps j 4 . The 
inclusion-exclusion complexity is therefore reduced from order 2 ; to a poly¬ 
nomial in j (perhaps j 2 or / ’). 


6.4.4 Subset Approximations 

In the last section, we discussed approximation by truncating the inclusion- 
exclusion expression. Now we discuss approximation by exclusion of low- 
probability cut sets or tie sets. Clearly, the occurrence probability of the lower- 
order (fewer edges) cut sets is higher than the higher-order (more edges) ones. 
Thus, we can approximate Eq. (6.14a) dropping C 3 and C 4 fourth-order cut 
sets and retaining the third-order cut set to yield an upper bound (since we 
have dropped cut sets, we are making an optimistic approximation). 

R ah < 1 - P(C\ + C 2 ) = 1 - P(C,) - P(C 2 ) + P(C\)P(C 2 ) 

= 1 - P(l'4'5') - P( T'6'2') + P(l'2'4'5'6') (6.25a) 

For q = 0.1, 
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R ab < 1 - 2 x 0.1 3 + 0.1 5 = 0.99801 (6.25b) 

We can use similar reasoning to develop a lower bound (we drop tie sets, 
thereby making a pessimistic approximation) by dropping all but the “one- 
hop” tie sets (T\) and the “two-hop” tie sets (T 2 , T 2 )—compare Eq. (6.12) and 
Table 6.2. 

Rab > P(T ] + T 2 + r 3 ) = P(\ + 25 + 46) = P(\) + P(25) + P( 46) 

[P(125) + Ell 46) + P(2456)] + [P(12456)J (6.26a) 


For p =T 0.9, 


R ah >p + 2p 2 - 2p 3 - p 4 + p 5 = 0.9 + 2 x 0.9 2 - 2 x 0.9 3 - 0.9 4 + 0.9 5 


= 0.99639 


(6.26b) 


Given Eq. (6.25b) and (6.26b), we can bound R ab by 

0.99639 < R ah < 0.99801 


and approximate R ab by the midpoint of the two bounds: 


0.99639 + 0.99801 

P(ib ~ X 


0.9971955 


(6.27) 


(6.28) 


The error bound for this approximation is computed in the same manner as 
Eq. (6.23). 


% error = 


|0.00361 - 0.00199| 
0.0028045 


x 100% = 57.8 


(6.29) 


The percentage error is larger than in the case of the truncation approxima¬ 
tions, but it remains small enough for the approximation to be valid. The com¬ 
plexity is still exponential—of order 2 X \ however, x is now a small integer and 
2 X is of modest size. Furthermore, the tie-set and cut-set algorithms take less 
time since we now do not need to find all cut sets and tie sets—only those of 
order < x. Of course, one can always combine both approximation methods by 
dropping out higher-order cut sets and then also truncating the expansion. For 
more details on network reliability approximations, see Murray [1992, 1993]. 


6.4.5 Graph Transformations 

Anyone who has studied electric-circuit theory in a physics or engineering 
class knows that complex networks of resistors can be reduced to an equiva¬ 
lent single resistance through various combinations of series, parallel, and T-A 
transformations. Such knowledge has obviously stimulated the development 
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1 2 

(a) Series: O-O-O 

a b c 


1 • 2 

O-O P ac = P( 1 * 2 ) 

a c 



Expand about 5: R ad = [P(5)Pr(G l ) + P(5')Pr(G 2 )] 

Figure 6.2 Illustration of series, parallel, and decomposition transformations for two- 
terminal pair networks. 


of equivalent network transformations: some that are remarkably similar, and 
some that are quite different, especially in the case of all-terminal reliability (to 
be discussed later). We must remember that these are not flow transformations 
but probability transformations. 

This method of calculating network reliability is based on transforming the 
network into a simpler network (or set of networks) by successively applying 
transformations. Such transformations are simpler for two-terminal reliability 
than for all-terminal reliability. For example, for two-terminal reliability, we 
can use the transformations given in Fig. 6.2. In this figure, the series transfor¬ 
mation indicates that we replace two branches in series with a single branch 
that is denoted by the intersection of the two original branches (1 -2). In the 
parallel transformation, we replace the two parallel branches with a single¬ 
series branch that is denoted by the union of the two parallel branches (1 +2). 
The edge-factoring case is more complex; the obvious branch to factor about is 
edge 5, which complicates the graph. Edge 5 is considered good and has a prob¬ 
ability of 1 (shorted), and the graph decomposes to G\. If edge 5 is bad, how¬ 
ever, it is assumed that no transmission can occur and that it has a probability 
of 0 (open circuit), and the graph decomposes to Go. Note that both G\ and Go 
can now be evaluated by using combinations of series and parallel transforma¬ 
tions. These three transformations—series, parallel, and decomposition—are 
all that is needed to perform the reliability analysis for many networks. 


Team-Ffy * 
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Now we discuss a more difficult network configuration. In the first transfor¬ 
mation in Fig. 6.2(a) series, we readily observe that both (intersection) edges 
1 and 2 must be up for a connection between a and b to occur. However, this 
transformation only works if there is no third edge connected to node b: if 
a third edge exists, a more elaborate transformation is needed (which will be 
discussed in Section 6.6 on all-terminal reliability). Similarly, in the case of 
the parallel transformation, nodes a and b are connected if either (union) a or 
b is up. 

Assume that any failures of edge 1 and edge 2 are independent and the 
probabilities of success for edges 1 and 2 are p\ and pi (probabilities of failure 
are q\ = l — p\, q 2 = ^ — Pi)- Then for the series subnetwork of Fig. 6.2(a), 
Pac = P 1 P 2 , and for the parallel subnetwork in Fig. 6.2(b), p a i, = p\+ P2 P 1 Pi = 
1 - <? 1 < 72 - 

The case of decomposition (called the keystone component method in sys¬ 
tem reliability [Shooman, 1990] or the edge-factoring method in network reli¬ 
ability) is a little more subtle; it is used to eliminate an edge x from a graph. 
Since all edges must either be up or down, we reduce the original network to 
two other networks G\ (given that edge x is up) and Gi (given that edge x is 
down). In general, one uses series and parallel transformations first, resorting 
to edge-factoring only when no more series or parallel transformations can be 
made. In the subnetwork of Fig. 6.2(c), we see that neither series nor parallel 
transformation is immediately possible because of edge 5, for which reason 
decomposition should be used. 

The mathematical basis of the decomposition transformation lies in the laws 
of conditional probability and Bayesian probability [Mendenhall, 1990, pp. 
64-65]. These laws lead to the following probability equation for terminal pair 
st and edge x. 

P( there is a path between 5 and t) 

= P(x is good) x P( there is a path between s and t |x is good) 

+ P(x is bad) x P( there is a path between and t\x is bad) (6.30) 

The preceding equation can be rewritten in a more compact notation as follows: 

Pst = P(x)P(G0 + P(x')P(G 2 ) (6.31) 

The term P(G\) is the probability of a connection between s and t for the 
modified network where x is good, that is, the terminals at either end of edge 
x are connected to the graph [see Fig. 6.2(c)]. Similarly, the term P(G 2 ) is the 
probability that there is a connection between s and t for the modified network 
Gi where x is bad, that is, the edge x is removed from the graph [again, see 
Fig. 6.2(c)]. Thus Eq. (6.31) becomes for st = ad: 

Pst = Ps(l - < 7 i < 73)(1 - < 72 <? 4 ) + d5(PiP2 + pm - PiPiPrn) ( 6 - 32 ) 
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a 1 b, d alb 




Gy Edge 6 good (short) G 2 : Edge 6 bad (open) 

Figure 6.3 Decomposition subnetworks for the graph of Fig. 6.1 expanded about 
edge 6. 


Of course, in most examples, networks G\ and G 2 are a bit more complex, and 
sometimes transformations are recursively computed. More examples of trans¬ 
formations appear in the problems at the end of this chapter; for a complete 
discussion of transformations, see Satyanarayana [1985] and A. M. Shooman 
[1992], 

We can illustrate the use of the three transformations of Fig. 6.2 on the 
network given in Fig. 6.1, where we begin by decomposing about edge 6. 

R ah = P( 6) ■ P\Gri + P(6') ■ P[G 2 ] (6.33) 

The networks G\ and G 2 are shown in Fig. 6.3. Note that for edge 6 good 
(up), nodes b and d merge in G\, whereas for edge 6 bad (down), edge 6 is 
simply removed from the network. 

We now calculate P(G\) and P(G 2 ) for a connection between nodes a and 
b with the aid of the series and parallel transformations of Fig. 6.2: 

P(Gi) = P( 1 + 4 + 52 + 53) = [P(l) + P(4) + P( 52) + P(53)] 

- [P(14) + P(152) + P(153) + P(452) + P(453) + P(523)] 

+ [P(4534) + P(1523) + P(1453) + P(1452)] - [P(14523)] (6.34) 

P(G 2 ) = P(1 + 25 + 243) = [P(l) + P( 25) + P(243)] 

- [P(125) + P(1243) + P(2543)] + [P(12543)] (6.35) 

Assuming that all edges are identical and independent with probabilities of 
success and failure of p and q, substitution into Eqs. (6.33), (6.34), and (6.35) 
yields 


R a b = p[2p+p 2 - 5 p 3 + 4 p 4 - p 5 ] + q[p + p 2 - 2p 4 + p 5 ] 


(6.36) 
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Substitution of p = 0.9 and q = 0.1 into Eq. (6.36) yields 


R ah = 0.9[0.99891] + 0.1[0.98829] = 0.997848 (6.37) 


Of course, this result agrees with the previous computation given in Eq. (6.16). 


6.5 NODE PAIR RESILIENCE 

All-terminal reliability, in which all the node pairs can communicate, is dis¬ 
cussed in the next section. Also, ^-terminal reliability will be treated as a speci¬ 
fied subset (2<k< all-terminal pairs) of all-terminal reliability. In this section, 
another metric, essentially one between two-terminal and all-terminal, is dis¬ 
cussed. 

Van Slyke and Frank [1972] proposed a measure they called resilience for 
the expected number of node pairs that can communicate (i.e., they are con¬ 
nected by one or more tie sets). Let .9 and t represent a node pair. The number 
of node pairs in a network with N nodes is the number of combinations of N 
choose 2, that is, the number of combinations of 2 out of N. 


/N\ N\ N(N- 1) 

Number of node pairs = ( 2 ) = '>\(N- r >)\ = — ~2 - ( 6 - 38 ) 

Our notation for the set of s, t node pairs contained in an N node network is 
{^, t} cz N, and the expected number of node pairs that can communicate is 
denoted as resilience, res(G): 


res(G) = Z R s , (6.39) 

(s.r) ciV 


We can illustrate a resilience calculation by applying Eq. (6.39) to the net¬ 
work of Fig. 6.1. We begin by observing that if p = 0.9 for each edge, symmetry 
simplifies the computation. The node pairs divide into two categories: the edge 
pairs ( ab , ad, be, and cd) and the diagonal pairs (ac and bd). The edge-pair 
reliabilities were already computed in Eqs. (6.36) and (6.37). For the diago¬ 
nals, we can use the decomposition given in Fig. 6.3 (where s = a and t = c) 
and compute R ac as shown in the following equations: 
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P(G\) = P[5 + (1 + 4)(2 + 5)] 

= P(5) + P( 1 + 4)P(2 + 5) - P(5)P(1 + 4)P(2 + 5) 


= [p + (l-p)(2p-p 2 ) 2 ] (6.40a) 

P{6)P{G\) = p[p + (1 - p)(2p - p 2 ) 2 ] = 0.898209 (6.40b) 

P(G 2 ) = P[ 5 + 12 + 43] = P( 5) + P(12) + P(43) 

- [P(512) + P(543) + P(1243)] 

+ [P(51243)] =p + 2p 2 - 2p 2 - p 4 +p 5 (6.41a) 

P(6')P(G 2 ) = q[p + 2p 2 - 2p 3 - p 4 + p 5 ] = 0.099639 (6.41b) 

R ac = 0.98209 + 0.099639 = 0.997848 (6.42) 

Substitution of Eqs. (6.37) and (6.42) into (6.39) yields 

res(G) = 4 x 0.997848 + 2 x 0.997848 = 5.987 (6.43) 


Note for this particular example that because all edge reliabilities are equal 
and because it is a complete graph, symmetry would have predicted that R st 
values were the same for node pairs ab, ad, be, and cd, and similarly for node 
pairs ac and bd. Clearly, for a very reliable network, the resilience will be 
close to the maximum N(N -1 )fl, which for this example is 6. In fact, it may 
be useful to normalize the resilience by dividing it by N(N — l)/2 to yield a 
“normalized” resilience metric. In our example, res(G)/6 = 0.997848. In gen¬ 
eral, if we divide Eq. (6.39) by Eq. (6.38), we obtain the average reliability for 
all the two-terminal pairs in the network. Although this requires considerable 
computation, the metric may be useful when the p, are unequal. 


6.6 ALL-TERMINAL RELIABILITY 

The all-terminal reliability problem is somewhat more difficult than the two- 
terminal reliability problem. Essentially, we must modify the two-terminal 
problem to account for all-terminal pairs. Each of the methods of Section 6.4 
is discussed in this section for the case of all-terminal reliability. 


6.6.1 Event-Space Enumeration 

We may proceed as we did in Section 6.4.1 except that now we examine all 
the good events for two-terminal reliability and strike out (i.e., classify as bad) 
those that do not connect all the terminal pairs. By applying these restrictions 
to Table 6.1, we obtain Table 6.3. From this table, we can formulate an all¬ 
terminal reliability expression similar to the two-terminal case. 
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TABLE 6.3 Modification of Table 6.1 for the All-Terminal Reliability Problem 


Event 

Connection 

ab 

Connection 

ac 

Connection 

ad 

Term 

Ei 

V 

V 

< 

P 6 

Ei, Ei, ... , Ei 

V 

V 

V 

qp 5 

E%, Eg, ... , En 

Eh, E14, Eie, En 

Ei%, Eig, Eio 

En, En, E14, En 
ElS, Eig, f?40, £41 

V 

V 

V 

2 4 

qp 

E42 

V 

V 

V 

3 3 
q p 

Other 26 fail for 
at least 1 
terminal pair 




3 3 4 2 

q p , qp , 

q 3 p l , q 6 


42 

^all-terminal = ^all = X P(Ej) (6.44) 

/ = 1 

i ¥= 25,31,35,36 

Note that events 25, 31, 35, and 36 represent the only failure events with three 
edge failures. These four cases involve isolation of each of the four vertices. 
All other failure events involve four or more failures. 

Substituting the terms from Table 6.3 into Eq. (6.44) yields 

fl a u - P 6 + 6 qp 5 + 15 q 2 p 4 + 16 q 3 p 3 = 0.9 6 + 6 x 0.1 x 0.9 5 + 15 
x 0.1 2 x 0.9 4 + 16 x 0.1 3 x 0.9 9 = 0.531441 
+ 0.354294 + 0.098415 + 0.011664 (6.45a) 

fl a u = 0.995814 (6.45b) 

Of course, the all-terminal reliability is lower than the two-terminal reliability 
computed previously. 

6.6.2 Cut-Set and Tie-Set Methods 

One can also compute all-terminal reliability using cut- and tie-set methods 
either via exact computations or via the various approximations. The compu¬ 
tations become laborious even for the modest-size problem of Fig. 6.1. Thus 
we will set up the exact calculations and discuss the solution rather than carry 
out the computations. Exact calculations for a practical network would be per¬ 
formed via a network modeling program; therefore, the purpose of this section 
is to establish the understanding of how computations are performed and also 
to serve as a background for the approximate methods that follow. 
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TABLE 6.4 Cut Sets and Tie Sets for All-Terminal 
Reliability Computations 


Pair ab 

Pair ad 

Pair ac 

Tie Sets 

T 1 = 1 

74 = 5 

Til =4 

T 2 = 25 

74 = 12 

Tu = 53 

73 =46 

7s = 43 

T 13 = 16 

74 = 234 

Tg = 163 

T 14 = 123 

T 5 = 356 

T 10 = 462 

745 =526 


Cut Sets 


Ci = 1'5'4' 

C 5 = 5'4'1' 

Cg = 4'6'3' 

C 2 = 1'2'6' 

C 6 = 5'3'2' 

C 10 = 4'5'T' 

C 3 = l'3'5'6' 

Ci = 5'6'4'2' 

Cn =4'1 , 2'3' 

C 4 = 1'2'3'4' 

C 8 =5 , 6 , 1'3' 

C 12 = 4 , 2'5 , 6' 


We begin by finding the tie sets and cut sets for the terminal pairs ab, ac, 
and ad (see Fig. 6.1 and Table 6.4). Note that if node a is connected to all 
other nodes, there is a connection between all nodes. 

In terms of tie sets, we can write 


P aii = P([path ab] ■ [path ad] ■ [path ac]) (6.46) 

P ail = P([Ti + T 2 + --- + T 5 ]-[T 6 + T 7 + --- + T 10 ] 

■[Tn + Tn + --- + T l5 ]) (6.47) 


The expansion of Eq. (6.47) involves 125 intersections followed by com¬ 
plex calculations involving expansion of the union of the resulting events 
(inclusion-exclusion); clearly, hand computations are starting to become 
intractable. A similar set of equations can be written in terms of cut sets. In 
this case, interrupting path ab, ad, or ac is sufficient to generate all-terminal 
cut sets. 


P a ii = 1 - P([no path ab] + [no path ad] + [no path ac]) (6.48) 

P all = 1 - P([Ci + C2 + C3 + C 4 ] + [Cs + c 6 + C 7 + C 8 ] 

+ [Cg + C 10 + Cu + Cn]) (6.49) 


The expansion of Eq. (6.49) involves the expansion of the union for 12 events 
(there are 2 12 terms; see Section A4.2) and the disjoint or reduced approxima¬ 
tion or computer solution are the only practical approaches. 
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6.6.3 Cut-Set and Tie-Set Approximations 

The difficulty in expanding Eqs. (6.47) and (6.48) makes approximations 
almost imperative in any pencil-paper-and-calculator analysis. We can begin 
by simplifying Eq. (6.49) by observing that cut sets C\ = C/ = C 10 , C 4 t = C\\, 
Ci = C 12 , and C 3 = Cs; then, C$, C 10 , Cn, Cu, and C% can be dropped, thereby 
reducing Eq. (6.49) to seven cut sets. Since all edges are assumed to have equal 
reliabilities, p = 1 — q, and the disjoint approximation for Eq. (6.49) yields 

Pall > 1 - P(Ci + C 2 + C 3 + C 4 + C 6 + C 7 + C g ) (6.50a) 


and substituting q = 0.1 yields 

Pan > 1 - \q’ - 3q 4 = 1 - 4(0.1 ) 3 - 3(0.1 ) 4 = 0.9957 (6.50b) 

To obtain an upper bound, we add the 21 terms in the second bracket in the 
expansion of Eq. (6.49) to yield 

Pan < 0.9957 + 17<7 5 + 4 q 6 = 0.99600 (6.51) 

If we average Eqs. (6.50b) and (6.51), we obtain P a n = 0.995759, which is 
(0.000174 x 100/0.0004) = 4.21% in error [cf. Eq. (6.24)]. In this case, the 
approximation yields excellent results. 


6.6.4 Graph Transformations 

In the case of all-terminal reliability, the transformation schemes must be 
defined in a more careful manner than was done for the two-terminal case 
in Fig. 6.2. The problem arises in the case in which a series transformation 
is to be performed. As noted in part (a) of Fig. 6.2, the series transformation 
eliminates node b, causing no trouble in the two-terminal reliability computa¬ 
tion where node b is not an initial or terminal vertex (for R st where neither s 
nor t is b). This is the crux of the matter, since we must still include node h in 
the all-terminal computation. Of course, eliminating node b does not invalidate 
the transmission between nodes a and c. If we continue to use Eq. (6.46) to 
define all-terminal reliability, the transformations given in Table 6.2 are cor¬ 
rect; however, we must evaluate all the events in the brackets of Eq. (6.46) 
and their intersections. Essentially, this reduces the transformation procedure 
to an equivalent tree with one node with incidence N- 1 (called a root or cen¬ 
tral node) and the remainder of incidence 1 (called pendant nodes). The tree 
is then evaluated. 

The more common procedure for all-terminal transformation is to reduce 
the network to two nodes, s and t, where the reliability of the equivalent s-t 
edge is the network all-terminal reliability. [A. M. Shooman, 1992]. A simple 
example (Fig. 6.4) clarifies the differences in these two approaches. 
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c lhh b 


£(1) = £(2) = £(3) = p = 0.9 

Figure 6.4 An example illustrating all-terminal reliability transformations. 


We begin our discussion of Fig. 6.4 by using event-space methods to calcu¬ 
late the all-terminal reliability. The events are E\ = 123, £? = 1'23, £3 = 12'3 , 
£4 = 123', E 5 = 1'2'3, E 6 = 1'23', £7 = 12'3', and £ 8 = 1'2'3'. By inspection, 
we see that the good events (which connect a io b and a to c) are, namely, E \, 
£ 2 , £ 3 , and £4. Note that any event with two or more edge failures isolates 
the vertex connected to these two edges and is a cut set. 

R aU = P{E\ + E 2 + £3 + £4) = P(E { ) + P(E 2 ) + £(£3) + (£4) 

= £(123) + £(1'23) + £(12'3) + £(123') 

= p 3 + 3 qp 2 = 0.9 3 + 3x 0.1 (0.9 ) 2 = 0.972 (6.52) 

To perform all-terminal reliability transformations in the conventional man¬ 
ner, we choose two modes, s and t, and reduce the network to an equivalent st 
edge. We can reduce any network using a combination of the three transforma¬ 
tions shown in Fig. 6.5. Note that the series transformation has a denominator 
term [ 1 — p(l')p(2')], which is the probability that the node that disappears (node 
b) is still connected. The other transformations are the same as the two-terminal 
case. Also, once the transformation process is over, the resulting probability 
p st must be multiplied by the connection probability of all nodes that have 
disappeared via the series transformation (for a proof of these procedures, see 
A. M. Shooman [1992]). We will illustrate the process by solving the network 
given in Fig. 6.4. 

We begin to transform Fig. 6.4 by choosing the st nodes to be c and b; 
thus we wish to use the series transformation to eliminate node a. The trans¬ 
formation yields an edge that is in parallel with edge 2 and has a probability 
of 


£(3)£(2) 

1 - £(3')£(2') 


0.9 x 0.9 


1 - 0.1 x 0.1 


0.8181818 


(6.53) 


Combining the two branches (cab and cb) in parallel yields 
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1 2 

(a) Series: O-O-O 

a b c 


P(1)P(2)/ 

[1-P(1')P(2')] 

O-O P a c = P\Pi l{] - c h<h) 

a c 



Expand about 5: R ad = [P(5)P(G 1 ) + P(5')P(G 2 )] 


G v Given edge 5 good: 


1 2 

oo P(G 1 ) = P[(l+3)]xP[(2 + 4)] 
ci g b — c ^ d 


G 2 , Given edge 5 bad: 



Figure 6.5 All-terminal graph-reduction techniques. 


P st = P(cb + cab) = p(cb) + p(cab) - p(cb)p(cab) 

= 0.9 + 0.8181818 - 0.9 x 0.8181818 = 0.981811 (6.54) 

The P st value must be multiplied by the probability that a is connected P(2 + 
3) = P( 2) + P(3) - P( 23) = 0.9 + 0.9 - 0.9 x 0.9 = 0.99. 


P al , = 0.99 x 0.9818118 = 0.971993682 


(6.55) 
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Cl Cl 



Figure 6.6 A second all-terminal reliability example. 


Of course, Eq. (6.55) yields the same result as Eq. (6.52). As a second exam¬ 
ple, suppose that you have the same three nodes as acb of Fig. 6.4 followed 
by a second identical triangle labeled c'a'b', and that nodes b and c are the 
same node (see Fig. 6.6). The reliability is given by the square of the value in 
Eq. (6.45), that is, 


ft al i = 0.971993682 2 = 0.9447717178 (6.56) 

If we use transformations, we have for P sl = P cb > = 0.9818118 2 = 0.9639544. 
This must be multiplied by 0.99 to correct for node a that has disappeared and 
0.99 for missing node a, yielding 0.99 2 x 0.9639544 = 0.9447717. A com¬ 
prehensive reliability computation program for two-terminal and all-terminal 
reliability using the three transformations in Figs. 6.2 or 6.5, as well as more 
advanced transformations, is given in A. M. Shooman [1992]. 

6.6.5 A>Terminal Reliability 

Up until now we have discussed two-terminal and all-terminal reliability. One 
can define a more general concept of ^-terminal reliability, where k terminals 
must be connected. When k = 2, we have two-terminal reliability; when k = n, 
we have all-terminal reliability. Thus ^-terminal reliability can be viewed as a 
more general concept. See A. M. Shooman [1991] for a detailed development 
of this concept. 

6.6.6 Computer Solutions 

The foregoing sections introduced the concepts of network reliability compu¬ 
tations. Clearly, the example of Fig. 6.1 is much simpler than most practical 
networks of interest, and, in general, a computer solution is required. Cur¬ 
rent research focuses on efficient computer algorithms for network reliability 
computation. For instance, Murray [1992] has developed a computer program 
that finds network cut sets and tie sets using different algorithms and various 
reduced cut-set and tie-set methods to compute reliability. A. M. Shooman 
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[1991, 1992] has developed a computer program for network reliability that is 
based on the transformations of Section 6.5 and is modified for the ^-termi¬ 
nal problem, in addition to other more advanced transformations. His model 
includes the possibility of node failure. 


6.7 DESIGN APPROACHES 

The previous sections treated the problem of network analysis. This section, 
however, treats the problem of design. We assume that there is a group of N 
cities or sites represented by nodes that are to be connected by E edges to form 
the network and that the various links have different costs and reliabilities. Our 
job is to develop design procedures that yield a good network—that is, at low 
cost with high reliability. (The reader is referred to the books by Colbourn 
[1987] and Shier [1991] for a comprehensive introduction to the literature in 
this held.) 

The problem stated in the previous paragraph is actually called the topo¬ 
logical design problem and is an abstraction of the complete design problem. 
The complete design problem includes many other considerations, such as the 
following: 

1. The capacity in bits per second of each line is an important consideration, 
and connections between nodes can fail if the number of messages per 
minute is too large for the capacity of the line, if congestion ensues, and 
if a queue of waiting messages forms that causes unacceptable delays in 
transmission. 

2. If messages do not go through because of interrupted transmission paths 
or excessive delays, information is fed back to various nodes. An algo¬ 
rithm, called the routing algorithm , is stored at one or more nodes, and 
alternate routes are generally invoked to send such messages via alternate 
paths. 

3. When edge or node failures occur, messages are rerouted, which may 
cause additional network congestion. 

4. Edges between nodes are based on communication lines (twisted-copper 
and fiber-optic lines as well as coaxial cables, satellite links, etc.) and 
represent a discrete (rather than continuous) choice of capacities. 

5. Sometimes, the entire network design problem is divided into a backbone 
network design (discussed previously) and a terminal concentrator design 
(for the connections within a building). 

6 . Political considerations sometimes govern node placement. For example, 
suppose that we wish to connect the cities New York and Baltimore and 
that engineering considerations point to a route via Atlantic City. For 
political considerations, however, it is likely that the route would instead 
go through Philadelphia. 
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For more information on this subject, the reader is referred to Kirshenbaum 
[1993]. The remainder of this chapter considers the topological design of a 
backbone network. 

6.7.1 Introduction 

The general problem of network design is quite complex. If we start with n 
nodes to connect, we have the problem of determining which set of edges 
(arcs) is optimum. Optimality encompasses parameters such as best reliability, 
lowest cost, shortest delay, maximum bandwidth, and most flexibility for future 
expansion. The problem is too complex for direct optimization of any practical 
network, so subsets and/or approximate optimizations are used to reduce the 
general problem to a tractable one. 

At the present, most network design focuses on minimizing cost with con¬ 
straints on time delay and/or throughput. Semiquantitative reliability con¬ 
straints are often included, such as that contained in the statement “the network 
should be at least two-connected,” that is, there should be no fewer than two 
paths between each node pair. This may not produce the best design when reli¬ 
ability is of great importance, and furthermore, a two-connected criterion may 
not yield a minimum cost design. This section approaches network design as a 
problem in maximization of network reliability within a specified cost budget. 

Fligh-capacity fiber-optic networks make it possible to satisfy network 
throughput constraints with significantly fewer lines than required with older 
technology. Thus such designs have less inherent redundancy and generally 
lower reliability. (A detailed study requires a comparison of the link reliabili¬ 
ties of conventional media versus fiber optics.) In such cases, reliability there¬ 
fore must be the focus of the design process from the outset to ensure that 
adequate reliability goals are met at a reasonable cost. 

Assume that network design is composed of three phases: (a) a backbone 
design, (b) a local access design, and (c) a local area network within the build¬ 
ing (however, here we focus our attention on backbone design). Assume also 
that backbone design can be broken into two phases: A\ and At. Phase A i is the 
choice of an initial connected configuration, whereas phase A> is augmentation 
with additional arcs to improve performance under constraint(s). Each new arc in 
general increases the reliability, cost, and throughput and may decrease the delay. 

6.7.2 Design of a Backbone Network Spanning-Tree Phase 

We begin our discussion of design by focusing on phase A]. A connected graph 
is one where all nodes are connected by branches. A connected graph with the 
smallest number of arcs is a spanning tree. In general, a complete network 
(all edges possible) with n nodes will have n'" 2| spanning trees, each with 
e s = (n — 1) edges (arcs). For example, for the four-node network of Fig. 6.1 
there are 4' 4 21 = 16 spanning trees with (4 - 1) = 3 arcs; these are shown in 
Fig. 6.7. 
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Figure 6.7 The 16 spanning trees for the network of Fig. 6.1. 


The all-terminal reliability of a spanning tree is easy to compute since 
removal of any one edge disconnects at least one terminal pair. Thus, each 
edge is one of the (n — 1) cut sets of the spanning tree, and all the (n — 1) 
edges are the single tie set. All edges of the spanning tree must work for all 
the terminals to be connected, and the all-terminal reliability of a spanning 
tree with n nodes and (n -1 ) independent branches with probabilities p, is the 
probability that all branches are good. 


n— 1 

flail = n Pi (6.57) 

i=i 

If all the branch reliabilities are independent and identical, /?,■ = p, and Eq. 
(6.57) becomes 
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2 



59 

58 


Figure 6.8 A graph of an early version of the ARPA network model (circa 1979). 



(6.58) 


Thus, for identical branches, all the spanning trees have the same reliability. 
If the branch reliabilities, pt, differ, then we can use the spanning tree with 
the highest reliability for the first-stage design of phase A\. In the case of 
the spanning trees of Fig. 6.7, we can compute each of the reliabilities using 
Eq. (6.57) and select the most reliable one as the starting point. If there are 
a large number of nodes, then an exhaustive search for the most reliable tree 
is no longer feasible. For example, a graph of an early version of the Military 
Advanced Research Planning Agency (ARPA) network shown in Fig. 6.8 has 
59 nodes. If we were to start over and ask what network we would obtain by 
using the design techniques under discussion, we would need to consider 59 57 
= 8.68 x lO 100 designs. Fortunately, Kruskal’s and Prim’s algorithms can be 
used for finding the spanning tree of a graph with the minimum weights. The 
use of these algorithms is discussed in the next section. 

For most small and medium networks, the graph model shown in Fig. 6.1 is 
adequate; for large networks, however, the use of a computer is mandated, and 
for storing network topology in a computer program, matrix (array) techniques 
are most commonly used. Two types of matrices are commonly employed: 
adjacency matrices and incidence matrices [Dierker, 1986; Rosen, 1999]. In a 
network with n nodes and e edges, an adjacency matrix is an n x n matrix 
where the rows and columns are labeled with the node numbers. The entries 
are either a zero, indicating no connection between the nodes, or a one, indi¬ 
cating an arc between the nodes. An adjacency matrix for the graph of Fig. 
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a 

a 0 
Nodes b 1 
c 1 
a' 0 
b' 0 


Nodes 

b c a' b' 

110 0 
0 111 
10 0 0 

10 0 1 

10 10 


(a) Adjacency Matrix 


1 2 

a 0 1 

Nodes b 1 1 

c 1 0 

a' 0 0 

b' 0 0 


Edges 

3 4 5 6 

10 0 0 

0 110 

10 0 0 

0 0 11 

0 10 1 


(b) Incidence Matrix 

Figure 6.9 Adjacency and incidence matrices for the example of Fig. 6.6. 


6.6 is shown in Fig. 6.9(a). Note that the main diagonal is composed of all 
zeroes since self-loops are generally not present in communications applica¬ 
tions, but they may be common in other applications of graphs. Also, the adja¬ 
cency matrix is applicable to simple graphs that do not have multiple edges 
between nodes. (An adjacency matrix can be adopted to represent a graph with 
multiple edges between nodes if each entry represents a list of all the connect¬ 
ing edges. Also, if the adjacency matrix can be made nonsymmetrical and if 
entries of +1 and — 1 for branches leaving and entering nodes can be intro¬ 
duced, then the incidence matrix can represent a directed graph.) The sum of 
all the entries in a row of an adjacency matrix is the degree of the node asso¬ 
ciated with the row. This degree is the number of edges that are incident on 
the node. 

One can also represent a graph by an incidence matrix that has n rows and 
e columns. A zero in any location indicates that the edge is not incident on the 
associated node, whereas a one indicates that the edge is incident on the node. 
(Multiple edges and self-loops can be represented by adding columns for these 
additional edges, and directed edges can be represented by +1 and — 1 entries.) 
An incidence matrix for the graph of Fig. 6.6 is shown in Fig. 6.9(b). 
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6.7.3 Use of Prim’s and Kruskal’s Algorithms 

A weighted graph is one in which one or more parameters are associated with 
each edge of the graph. In a network, the associated parameters are commonly 
cost, reliability, length, capacity, and delay. A common problem is to find a 
minimum spanning tree—a spanning tree with the smallest possible sum of 
the weights of the edges. Either Prim’s or Kruskal’s algorithms can be used to 
find a minimum spanning tree [Cormen, 1992; Dierker, 1986; Kershenbaum, 
1993; and Rosen, 1999]. Both are “greedy” algorithms—an optimum choice 
is made at each step of the procedure. 

In Section 6.7.2, we discussed the use of a spanning tree as the first phase 
of design. An obvious choice is the use of a minimum-cost spanning tree for 
phase Ai, but in most cases an exhaustive search of the possible spanning trees 
is impractical and either Kruskal’s or Prim’s algorithm can be used. Similarly, 
if the highest-reliability spanning tree is desired for phase Ai, these same algo¬ 
rithms can be used. Examination of Eq. (6.57) shows that the highest reliability 
is obtained if we select the set of p,s that has a maximum sum, since this will 
also result in a maximum product. [This is true because the log of Eq. (6.57) 
is the sum of the logs of the individual probabilties, and the maximum of the 
log of a function is the same as the maximum of the function.] Prim’s and 
Kruskal’s algorithms can also be used to find a maximum spanning tree—a 
spanning tree with the largest possible sum of the weights of the edges. Thus 
it will find the highest-reliability spanning tree. Another approach is to find the 
spanning tree with the minimum probabilities of failure, which also maximizes 
the reliability. 

A simple statement of Prim’s algorithm is to select an initial edge of mini¬ 
mum (maximum) weight, by search or by first ordering all edges by weight, and 
to use this edge to begin the tree. Subsequent edges are selected as minimum 
(maximum) weight edges from the set of edges that is connected to a node 
of the tree but does not form a circuit (loop). The process is continued until 
(n - 1) edges have been selected. For a description of Kruskal’s algorithm (a 
similar procedure), and the choice between Kruskal’s and Prim’s algorithms, 
the reader is directed to the references [Cormen, 1992; Dierker, 1986; Ker¬ 
shenbaum, 1993; and Rosen, 1999]. The use of Prim’s algorithm in design is 
illustrated by the following problem. 

We wish to design a network that connects six cities represented by the 
graph nodes of Fig. 6.10(a). The edge reliabilities and edge costs are given 
in Fig. 6.10(b) and (c), which are essentially weighted incidence matrices in 
which the entries to the left of the diagonal are deleted for clarity because they 
are symmetrical about the main diagonal. 

We begin our design by finding the lowest-cost spanning tree using Prim’s 
algorithm. To start, we order the edge costs as shown in the first column of 
Table 6.5. The algorithm proceeds as follows until 5 edges forming a tree 
are selected from the 15 possible edges to form the lowest-cost spanning tree: 
select 1-2; select 1-4; select 2-3; we cannot select 2—4 since it forms a loop 
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02 

01 ^ 03 

04 

05 06 


(a) Network Nodes 



1 2 

3 

4 

5 

6 

1 

0.94 

0.91 

0.96 

0.93 

0.92 

2 


0.94 

0.97 

0.91 

0.92 

3 



0.94 

0.90 

0.94 

4 




0.93 

0.96 

5 





0.91 


(b) Edge Reliabilities 


1 

2 

3 

4 

5 

6 

1 

10 

25 

10 

20 

30 

2 


10 

10 

25 

20 

3 



20 

40 

10 

4 




20 

10 

5 





30 


(c) Edge Costs 

Figure 6.10 A network design example. 


(delete selection); select 3-6; we cannot select 4-6 since it forms a loop (delete 
selection); and, finally, select 1-5 to complete the spanning tree. The sequence 
of selections is shown in the second and third columns of Table 6.5. Note that 
the remaining 8 edges 2-6 through 3-5 are not considered. One could have 
chosen edge 4-5 instead of 1-5 for the last step and achieved a different tree 
with the same cost. The total cost of this minimum-cost tree is 10 + 10 + 10 
+ 10 + 20 = 60 units. The reliability of this network can be easily calculated 
as the product of the edge reliabilities: 0.94 x 0.96 x 0.94 x 0.94 x 0.93 = 
0.71415. The resulting network is shown in Fig. 6.11(a). 

Now, we repeat the design procedure by calculating a maximum-reliability 
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TABLE 6.5 Prim’s Algorithm for Minimum Cost" 


Edge 

Cost 

Selected 

Step No. 

Deleted 

Step No. 

1-2 

10 

1 

— 

1-4 

10 

2 

— 

2-3 

10 

3 

— 

2-4 

10 

— 

4 

3-6 

10 

5 

— 

4-6 

10 

— 

6 

1-5 

20 

7 

— 

2-6 

20 



3-4 

20 



4-5 

20 



1-3 

25 



2-5 

25 



1-6 

30 



5-6 

30 



3-5 

40 




"See Fig. 6.10. 



Cost = 60, Reliability = 0.7415 
(a) Minimum Cost Spanning Tree 



Cost = 60, Reliability = 0.7814 
(b) Maximum Reliability Spanning Tree 
Figure 6.11 Two spanning tree designs for the example of Fig. 6.10. 
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TABLE 6.6 Prim’s Algorithm for Maximum Reliability" 


Edge 

Reliability 

Selected 

Step No. 

Deleted 

Step No. 

2-4 

0.97 

1 

— 

1-4 

0.96 

2 

— 

4-6 

0.96 

3 

— 

1-2 

0.94 

— 

4 

2-3 

0.94 

5 

— 

3—4 

0.94 

— 

6 

3-6 

0.94 

— 

7 

1-5 

0.93 

8 

— 

4-5 

0.93 



1-6 

0.92 



2-6 

0.92 



1-3 

0.91 



2-5 

0.91 



5-6 

0.91 



3-5 

0.90 




"See Fig. 6.10. 


tree using Prim’s algorithm. The result is shown in Table 6.6 and the resul¬ 
tant tree is in Fig. 6.11(b). The reliability optimization has produced a superior 
design with the same cost but a higher reliability. In general, the two proce¬ 
dures will produce designs having lower costs and lower reliabilities along 
with higher reliabilities and higher costs. In such cases, engineering trade-offs 
must be performed for design selection. Perhaps a different solution with less- 
than-optimum cost and reliability would be the best solution. Since the design 
procedure for the spanning tree phase is not too difficult, a group of spanning 
trees can be computed and carried forward to the enhancement phase, and a 
design trade-off can be performed on the final designs. 

Since both cost and reliability matter in the optimization, and also since 
there are many ties, we can return to Table 6.5 and re-sort the edges by cost 
and wherever ties occur by reliability. The result is Table 6.7, which in this 
case yields the same design as Table 6.6. 

The decision to use a spanning tree for the first stage of design is primarily 
based on the existence of a simple algorithm (Prim’s or Kruskal’s) to obtain 
phase Ai of the design. Other procedures, however, could be used. For exam¬ 
ple, one could begin with a Hamiltonian circuit (tour), a network containing N 
edges and one circuit that passes through each node only once. A Hamiltonian 
circuit has one more edge than a spanning tree. (The reader is referred to the 
problems at the end of this chapter for a consideration of Hamiltonian tours 
for phase A\ of the design.) Hamiltonian tours do not exist for all networks 
[Dierker, 1986; Rosen, 1999], but if we consider that all edges are potentially 
possible, that is, a complete graph, Hamiltonian tours will exist [Frank, 1971]. 
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TABLE 6.7 Prim’s Algorithm for Minimum Cost Edges First Sorted by Cost 
and Then by Reliability" 


Edge 

Cost 

Reliability 

Selected 

Step No. 

Deleted 

Step No. 

2-4 

10 

0.97 

1 

— 

1-4 

10 

0.96 

2 

— 

4-6 

10 

0.96 

3 

— 

1-2 

10 

0.94 

— 

4 

2-3 

10 

0.94 

5 

— 

3-6 

10 

0.94 

— 

6 

3-4 

20 

0.94 

— 

7 

1-5 

20 

0.93 

8 

— 

4-5 

20 

0.93 



2-6 

20 

0.92 



1-3 

25 

0.91 



2-5 

25 

0.91 



1-6 

30 

0.92 



5-6 

30 

0.91 



3-5 

40 

0.90 




"See Fig. 6.10. 


6.7.4 Design of a Backbone Network: The Enhancement Phase 

The first phase of design, A\, will probably produce a connected network with 
a smaller all-terminal reliability than required. To improve the reliability, new 
branches are added. Unfortunately, the effect on the network all-terminal relia¬ 
bility is now a function of not only the reliability of the added branch but also 
of its location in the network. Thus we must evaluate the network reliability 
for each of the proposed choices and pick the most cost-effective solution. The 
use of network reliability calculation programs greatly aids such a trial-and- 
error procedure [Murray, 1992; A. M. Shooman, 1992]. There are also various 
network design programs that incorporate reliability and other calculations to 
arrive at a network design [Kershenbaum, 1993]. 

In general, one designs a network with only a fraction of all the edges that 
would be available in a complete graph, which is given by 


e c = 


n(n - 1) 


(6.59) 


If phase A\ of the design is a spanning tree with n — 1 edges, there are e r 
remaining edges given by 

n(n-l) (n - l)(n - 2) 

e r = e c -e, = --- (n - 1) =--- (6.60) 

For the example given in Fig. 6.10, we have 6 nodes, 15 possible edges in 
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the complete graph, 5 edges in the spanning tree, and 10 possible additional 
arcs that can be assigned during the enhancement phase. We shall experiment 
with a few examples of enhancement and leave further discussion for the exam¬ 
ples at the end of this chapter. Three attractive choices for enhancement are the 
additional cost (=10 edges not used in Table 6.7), edge 1-2, and edge 3-6. A 
simplifying technique will be used to evaluate the reliability of the enhanced 
network. We let R Al be the reliability of the network created by phase A \ and 
let X represent the success of the added branch ( X' is the event failure of the 
added branch). Using the decomposition (edge-factoring) theorem given in Fig. 
6.5(c), we can write the reliability of the enhanced network R A2 as 

R Al = P(X)P( Network|A) + P(X' )P(Network | X') (6.61) 

The term P(Network|X') is the reliability of the network with added edge X 
failed (open-circuited) = R Al , that is, what it was before we added the enhance¬ 
ment edge. If P(X) is denoted by p x , then P(X') = 1- p x . Lastly, ^(Network| A) 
is the reliability of the network with X good, that is, with both nodes for edge 
X merged (one could say that edge X was “shorted,” which simplifies the com¬ 
putation). Thus Eq. (6.61) becomes 

R Al = p v F( Network |.r shorted) + (1 - p x )R\, (6.62) 

Evaluation of the X shorted term for the addition of edge 1-1 or 3-6 is given 
in Fig. 6.12. Note that in Fig. 6.12(a), there are parallel branches between (1 
= 2) and 4, which are reduced in parallel. Similarly, in Fig. 6.12(b), edges 4-2 

and (2 = 6) and 3 are reduced in series and then in parallel with 4 — (6 = 3). 

Note that in the computations given in Fig. 6.12(b), the series transformation 
and the overall computation must be “conditioned” by the terms shown in { } 
[see Fig. 6.5(a) and Eqs. (6.53)—(6.55)]. Substitution into Eq. (6.62) yields 

R A2 = (0.94) [0.838224922] + (0.6) [0.7814] 

= 0.8348 (for addition of edge 1-2) (6.63) 

R A2 = (0.94) [0.8881] + (0.6) [0.7814] 

= 0.8817 (for addition of edge 3-6) (6.64) 

The addition of edge 3-6 seems to be more cost-effective than if edge 1-2 were 
added. In general, the objective of the various design methods is to choose 
branches that optimize the various parameters of the computer network. The 
reader should consult the references for further details. 


6.7.5 Other Design Approaches 

Although network design is a complex, difficult problem, substantial work has 
been done in this held over the years. A number of algorithms in addition to 
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R = (0.93)(0.94)(0.96)[1 - 0.04 x 0.03] 
R = (0.93)(0.94)(0.96)[0.9988] 

R = 0.838224922 


(a) Addition of edges 1-2 



« 426-3 = (0.97)(0.94)/{ 1 - 0.03 x 0.06} = 0.9134 
« 426-3 11 S 46-3 = 1 — [(1 — 0.9134) x (1 - 0.96)] = 0.9965 
/’(Network I X short) = (0.93)(0.96)(0.9965) 

x (0.97 + 0.94-0.97x0.94) =0.8881 


(b) Addition of edges 3-6 
Figure 6.12 Evaluation of the X shorted term. 


Prim’s and Kruskal’s are used for design approaches (some of these algorithms 
are listed in Table 6.8). The reader may consult the following references for 
further details: Ahuja [1982]; Colbourn [1987, 1993, 1995]; Corman [1992]; 
Frank [1971]; Kershenbaum [1993]; and Tenenbaum [1981, 1996]. 

The various design approaches presently in use are described in Ahuja 
[1982] and Kershenbaum [1993]. A topological design algorithm usually con¬ 
sists of two parts: the first initializes the topology and the second iteratively 
improves it (we have used the terms connected toplogy and augmentation to 
improve performance to refer to the first and second parts). Design algorithms 
begin by obtaining a feasible network topology. Augmentation adds further 
edges to the design to improve such various design parameters as network 
delay, cost, reliability, and capacity. The optimizing step is repeated until one 
of the constraints is exceeded. A number of algorithms and heuristics can be 
applied. Most procedures start with a minimum spanning tree, but because the 
spanning tree algorithms do not account for link-capacity constraints, they are 
often called unconstrained solutions. The algorithms then modify the starting 
topology until it satisfies the constraints. 

The algorithms described by Ahuja’s book [1982] include Kruskal’s, 
Chandy-Russell’s, and Esau-Williams’s, in addition to Kershenbaum-Chou’s 
generalized heuristic. Kershenbaum’s book [1993] discusses these same algo¬ 
rithms, but, in addition, it describes Sharma’s algorithm, commenting that 
this algorithm is widely used in practice. The basic procedure for the 


REFERENCES 


321 


TABLE 6.8 Various Network Design Algorithms 

Tree-Traversal Algorithms 

(1) Breath first search (BFS) 

(2) Depth first search (DFS) 

(3) Connected-components algorithm 

(4) Minimum spanning tree: 

(a) “greedy” algorithm 

(b) Kruskal’s algorithm 

(c) Prim’s algorithm 

Shortest-Path Algorithms 

(1) Dijkstra’s algorithm 

(2) Bellman’s algorithm 

(3) Floyd’s algorithm 

(4) Incremental-shortest-path algorithms 

Single-Commodity-Network Flows 

(1) Ford-Fulkerson algorithm 

(2) Minimum-cost flows 


Esau-Williams’s algorithm is to use a “greedy” type of algorithm to construct a 
minimum spanning tree. One problem that arises is that nodes may connect to 
the center of one component (group) of nodes and leave nodes that are stranded 
far from a center (a costly connection). The Esau-Williams’s algorithm aids in 
eliminating this problem by implementing a trade-off between the connection 
to a center and the interconnection between two components. 

Kershenbaum [1993] discusses other algorithms (Gerla’s, Frank’s, Chou’s, 
Eckl’s, and Maruyama’s) that he calls branch-exchange algorithms. The design 
starts with a feasible topology and then locally modifies it by adding or drop¬ 
ping links; alternatively, however, the design may start with a complete graph 
and identify links to drop. One can decide which links to drop by finding the 
flow in each link, computing the cost-to-flow ratio, and removing the link with 
the largest ratio. If an improvement is obtained, the exchange is accepted; if 
not, another is tried. Kershenbaum speaks of evaluation in terms of cost and/or 
delay but not reliability. Thus the explicit addition of reliability to the design 
procedure results in different solutions. In all cases, these will emphasize reli¬ 
ability, but some cases will produce an improved solution. 
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PROBLEMS 

6.1. Consider a computer network connecting the cities of Boston, Hartford, 
New York, Philadelphia, Pittsburgh, Baltimore, and Washington. (See 
Fig. PI.) 


O Boston 
O Hartford 

O New York 

O Pittsburgh O Philadelphia 


O Baltimore 
O Washington 

Figure PI 

(a) What is the minimum number of lines to connect all the cities? 

(b) What is the best tree to pick? 
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6 . 2 . 



You are to evaluate the reliability of the network shown in Fig. P2. Edge 
reliabilities are all 0.9 and independent. 

(a) Find the two-terminal reliability, Rac- 

(1) using state-space techniques, 

(2) using tie-set techniques, 

(3) using transformations. 

(b) Find the all-terminal reliability, R. d \\: 

(1) using state-space techniques, 

(2) using cut-set approximations, 

(3) using transformations. 

6.3. The four-node network shown in Fig. P3 is assumed to be a complete 
graph (all possible edges are present). Assume that all edge reliabilities 
are 0.9. 


o o 

A B 

o o 

C D 

Figure P3 

(a) By enumeration, find all the trees you can define for this network. 

(b) By enumeration, find all the Hamiltonian tours you can find for this 
network. 

(c) Compute the all-terminal reliabilities for the tree networks. 

(d) Compute the all-terminal reliabilities for the Hamiltonian networks. 

(e) Compute the two-terminal reliabilities Rab for the trees. 
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(f) Compute the two-terminal reliabilities R AB for the Hamiltonian tours. 

6.4. Assume that the edge reliabilities for the network given in problem 6.3 
are R(AB) = 0.95; R(AC) = 0.92; R(AD) = 0.90; R(BC) = 0.90; R(BD ) = 
0.80; and R(CD ) = 0.80. 

(a) Repeat problem 6.3(c) for the preceding edge reliabilities. 

(b) Repeat problem 6.3(d) for the preceding edge reliabilities. 

(c) Repeat problem 6.3(e) for the preceding edge reliabilities. 

(d) Repeat problem 6.3(f) for the preceding edge reliabilities. 

(e) If the edge costs are C(AB ) = C(AC) = 10; C(AD ) = C(BC) = 8 ; and 
C(BD) = C(CD) = 7, which of the networks is most cost-effective? 

6.5. Prove that the number of edges in a complete, simple N node graph is 
given by N(N - l)/2. (Hint: Use induction or another approach.) 

6 . 6 . 


B 



Figure P4 

For the network shown in Fig. P4, find 

(a) the adjacency matrix, 

(b) the incidence matrix. 

6.7. A simple algorithm for finding the one-link and two-link (one-hop and 
two-hop) paths between any two nodes is based on the properties of the 
incidence matrix. 

(a) If a one-hop path (one-link tie set) exists between any two nodes, 
ri\ and « 2 , there will be ones in the n \ and n A rows for the column 
corresponding to the one-link tie set. 

(b) If a two-hop tie set exists between any two nodes, it can be found 
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by taking the sum of the matrix entries for the two columns. The 
numeral 1 appears in the column for the connected nodes and 2 for 
the intermediate node. 

(c) Write a “pseudocode” algorithm for this approach to finding one- 
and two-hop tie sets. 

(d) Does this generalize to n hops? Explain the difficulties. 

6 .8. Can the algorithm of problem 6.7 be adapted to a directed graph? (See 
Shooman [1990, p. 609].) 

6.9. What is the order of the algorithm of problem 6.7 ? 

6.10. Write a simple program in any language to implement the algorithm of 
problem 6.9 for up to six nodes. 

6.11. Compute the two-terminal reliability R a j for the network of problem 6.6 
by using the program of problem 6.10. 

(a) Assume that all links have a success probability of 0.9. 

(b) Assume that links 1, 2, and 3 have a success probability of 0.95 and 
that links 4 and 5 have a success probability of 0.85. 

6.12. Check the reliability of problem 6.11 by using any analytical technique. 

6.13. This homework problem refers to the network described in Fig. 6.10 in 
the text. All links are potentially possible, and the matrices define the 
link costs and reliabilities. Assume the network in question is composed 
of links 1-2; 2-3; 3-6; 6-5; 5-1; 1—4; and 4-6. Compute the two-ter- 
minal reliabilities 7?26 and Rn by using the following methods: 

(a) state-space, 

(b) tie-set, 

(c) cut-set, 

(d) keystone-component, 

(e) edge-transformation. 

6.14. Assume that you are doing a system design for a reliable, low-cost net¬ 
work with the same geometry, potential arc costs, and reliabilities as 
those given in problem 6.13. Compare the different designs obtained by 
plotting reliability versus cost, where we assume that the maximum cost 
budget is 100. Use the following approaches to network design: 

(a) Start with a minimum-cost tree and add new minimum-cost edges. 

(b) Start with a minimum-cost tree and add new highest-reliability 
edges. 

(c) Start with a highest-reliability tree and add new highest-reliability 
edges. 
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(d) Start with a highest-reliability Hamiltonian tour and add new 
highest-reliability edges. 

(e) Start with a minimum-cost Hamiltonian tour and add new highest- 
reliability edges. 

6 . 15 . For a Hamiltonian tour in which the n branch probabilities are inde¬ 
pendent and equal, show that the all-terminal reliability is given by the 
binomial distribution: 


Rm = P(no branch failure + one branch failure) 

= p n + np Ul - l \l - p) 

6 . 16 . Repeat problem 6.15 for a Hamiltonian tour in which the branch prob¬ 
abilities are all different. Instead of the binomial distribution, you will 
have to write out a series of terms that yields a formula different from 
that of problem 6.15, but one that also reduces to the same formula for 
equal probabilities. 

6 . 17 . Clearly, a Hamiltonian network of n identical elements has a higher reli¬ 
ability than that of a spanning tree for n nodes. Show that the improve¬ 
ment ratio is [p + n(l — p)\. 

6 . 18 . In this problem, we make a rough analysis to explore how the capacity 
of communication lines affects a communication network. 

(a) Consider the nodes of problem 6.1 and connect them in a 
Hamiltonian tour network in Table PI: Boston-Hartford-New 
York-Philadelphia-Pittsburgh-B altimore-Washington-Boston. As¬ 
sume that the Hartford-Baltimore-Pittsburgh traffic is small (2 units 
each way) compared to the Boston-New York-Philadelphia-Wash- 
ington traffic (10 units each way). Fill in Table PI with total traffic 
units, assuming that messages are always sent (routed) via the short¬ 
est possible way. Assume full duplex (simultaneous communication 
in both directions). 

(b) Assume a break on the Philadelphia-Pittsburgh line. Some messages 
must now be rerouted over different, longer paths because of this 
break. For example, the Washington-Boston line must handle all of 
the traffic between Philadelphia, New York, Boston, and Washing¬ 
ton. Recompute the traffic table (Table P2). 
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RELIABILITY OPTIMIZATION 


7.1 INTRODUCTION 

The preceding chapters of this book discussed a wide range of different tech¬ 
niques for enhancing system or device fault tolerance. In some applications, 
only one of these techniques is practical, and there is little choice among the 
methods. However, in a fair number of applications, two or more techniques 
are feasible, and the question arises regarding which technique is the most 
cost-effective. To address this problem, if one is given two alternatives, one 
can always use one technique for design A and use the other technique for 
design B. One can then analyze both designs A and B to study the trade-offs. 
In the case of a standby or repairable system, if redundancy is employed at a 
component level, there are many choices based on the number of spares and 
which component will be spared. At the top level, many systems appear as a 
series string of elements, and the question arises of how we are to distribute 
the redundancy in a cost-effective manner among the series string. Specifically, 
we assume that the number of redundant elements that can be added is limited 
by cost, weight, volume, or some similar constraint. The object is to determine 
the set of redundant components that still meets the constraint and raises the 
reliability by the largest amount. Some authors refer to this as redundancy opti¬ 
mization [Barlow, 1965]. Two practical works—Fragola [1973] and Mancino 
[1986]—are given in the references that illustrate the design of a system with 
a high degree of parallel components. The reader should consult these papers 
after studying the material in this chapter. 

In some ways, this chapter can be considered an extension of the material 
in Chapter 4. However, in this chapter we discuss the optimization approach, 
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where rather than having the redundancy apply to a single element, it is dis¬ 
tributed over the entire system in such a way that it optimizes reliability. The 
optimization approach has been studied in the past, but is infrequently used in 
practice for many reasons, such as (a) the system designer does not understand 
the older techniques and the resulting mathematical formulation; (b) the solu¬ 
tion takes too long; (c) the parameters are not well known; and (d) constraints 
change rapidly and invalidate the previous solution. We propose a technique 
that is clear, simple to explain, and results in the rapid calculation of a family 
of good suboptimal solutions along with the optimal solution. The designer is 
then free to choose among this family of solutions, and if the design features 
or parameters change, the calculations can be repeated with modest effort. 

We now postulate that the design of fault-tolerant systems can be divided 
into three classes. In the first class, only one design approach (e.g., parallel, 
standby, voting) is possible, or intuition and experience points only to a sin¬ 
gle approach. Thus it is simple to decide on the level of redundancy required 
to meet the design goal or the level allowed by the constraint. To simplify 
our discussion, we will refer to cost, but we must keep in mind that all the 
techniques to be discussed can be adapted to any other single constraint or, 
in many cases, multiple constraints. Typical multiple constraints are cost, reli¬ 
ability, volume, and weight. Sometimes, the optimum solution will not satisfy 
the reliability goal; then, either the cost constraint must be increased or the 
reliability goal must be lowered. In the second class, if there are two or three 
alternative designs, we would merely repeat the optimization for each class 
as discussed previously and choose the best result. The second class is one 
in which there are many alternatives within the design approach because we 
can apply redundancy at the subsystem level to many subsystems. The third 
class, where a mixed strategy is being considered, also has many combinations. 
To deal with the complexity of the third-class designs, we will use computer 
computations and an optimization approach to guide us in choosing the best 
alternative or set of alternatives. 


7.2 OPTIMUM VERSUS GOOD SOLUTIONS 

Because of practical considerations, an approximate optimization yielding a 
good system is favored over an exact one yielding the best solution. The param¬ 
eters of the solution, as well as the failure rates, weight, volume, and cost, are 
generally only known approximately at the beginning of a design; moreover, in 
some cases, we only know the function that the component must perform, not 
how that function will be implemented. Thus the range of possible parameters 
is often very broad, and to look for an exact optimization when the parameters 
are known only over a broad range may be an elegant mathematical formula¬ 
tion but is not a practical engineering solution. In fact, sometimes choosing the 
exact optimum can involve considerable risk if the solution is very sensitive 
to small changes in parameters. 



OPTIMUM VERSUS GOOD SOLUTIONS 333 


To illustrate, let us assume that there are two design parameters, x and y, 
and the resulting reliability is z. We can visualize the solution as a surface in 
x, y, z space, where the reliability is plotted along the vertical z-axis as the two 
design parameters vary in the horizontal xy plane. Thus our solution is a surface 
lying above the xy plane and the height (z) of the surface is our reliability that 
ranges between 0 and unity. Suppose our surface has two maxima: one where 
the surface is a tall, thin spire with the reliability zs = 0.98 at the peak, which 
occurs at xs, ys, and the other where the surface is a broad one and where the 
reliability reaches zb = 0.96 at a small peak located at xb, yb in the center of 
a broad plateau having a height of 0.94. Clearly, if we choose the spire as our 
design and if parameters x or y are a little different than xs, ys, the reliability 
may be much lower—below 0.96 and even below 0.94—because of the steep 
slopes on the flanks of the spire. Thus the maximum of 0.96 is probably a better 
design and has less risk, since even if the parameters differ somewhat from xb, 
yb, we still have the broad plateau where the reliability is 0.94. Most of the 
exact optimization techniques would choose the spire and not even reveal the 
broad peak and plateau as other possibilities, especially if the points xs, ys and 
xb, yb were well-separated. Thus it is important to find a means of calculating 
the sensitivity of the solution to parameter variations or calculating a range of 
good solutions close to the optimum. 

There has been much emphasis in the theoretical literature on how to find 
an exact optimization. The brute force approach is to enumerate all possible 
combinations and calculate the resulting reliability; however, except for small 
problems, this approach requires long or intractable computations. An alter¬ 
nate approach uses dynamic programming to reduce the number of possible 
combinations that must be evaluated by breaking the main optimization into 
a sequence of carefully formulated suboptimizations [Bierman, 1969; Hiller, 
1974; Messinger, 1970]. The approach that this chapter recommends is the 
use of a two-step procedure. We assume that the problem in question is a 
large system. Generally, at the top level of a large system, the problem can 
be modeled as a series connection of a number of subsystems. The process of 
apportionment (see Lloyd [1977, Appendix 9A]) is used to allocate the sys¬ 
tem reliability (or availability) goal among the various subsystems and is the 
first step of the procedure. This process should reduce a large problem into a 
number of smaller subproblems, the optimization of which we can approach by 
using a bounded enumeration procedure. One can greatly reduce the size of the 
solution space by establishing a sequence of bounds; the resulting subsystem 
optimization is well within the power of a modern PC, and solution times are 
reasonable. Of course, the first step in the process—that of apportionment—is 
generally a good one, but it is not necessarily an optimum one. It does, how¬ 
ever, fit in well with the philosophy alluded to in the previous section that a 
broad, easy-to-achieve, easy-to-understand suboptimum is preferred in a prac¬ 
tical case. As described later in this chapter, allocation tends to divert more 
resources to the “weakest link in the chain.” 

There are other important practical arguments for simplified semioptimum 
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techniques instead of exact mathematical optimization. In practice, optimiz¬ 
ing a design is a difficult problem for many reasons. Designers, often harried 
by schedule and costs, look for a feasible solution to meet the performance 
parameters; thus reliability may be treated as an afterthought. This approach 
seldom leads to a design with optimum reliability—much less a good sub- 
optimal design. The opposite extreme is the classic optimization approach, in 
which a mathematical model of the system is formulated along with constraints 
on cost, volume, weight, and so forth, where all the allowable combinations 
of redundant parallel and standby components are permitted and where the 
underlying integer programming problem is solved. The latter approach is sel¬ 
dom taken for the previously stated reasons: (a) the system designer does not 
understand the mathematical formulation or the solution process; (b) the solu¬ 
tion takes too long; (c) the parameters are not well known; and (d) the con¬ 
straints rapidly change and invalidate the previous solution. Therefore, clear, 
simple, and rapid calculation of a family of good suboptimal solutions is a 
sensible approach. The study of this family should reveal which solutions, if 
any, are very sensitive to changes in the model parameters. Furthermore, the 
computations are simple enough that they can be repeated should significant 
changes occur during the design process. Establishing such a range of solutions 
is an ideal way to ensure that reliability receives adequate consideration among 
the various conflicting constraints and system objectives during the trade-off 
process—the preferred approach to choosing a good, well-balanced design. 


7.3 A MATHEMATICAL STATEMENT OF THE OPTIMIZATION 
PROBLEM 

One can easily define the classic optimization approach as a mathematical 
model of the system that is formulated along with constraints on cost, vol¬ 
ume, weight, and so forth, in which all the allowable combinations of redun¬ 
dant parallel and standby components are permitted and the underlying integer 
programming problem must be solved. 

We begin with a series model for the system with k components where x\ 
is the event success of element one, X\ is the event failure of element one, 
and P(x i) =1 - P (I\) is the probability of success of element one, which 
is the reliability, r\ (see Fig. 7.1). Clearly, the components in the foregoing 
mathematical model can be subsystems if we wish. 

The system reliability is given by the probability of the event in which all 
the components succeed (the intersection of their successes): 


r s = p(x i n X 2 n • • • n x ^) 


(7.1a) 


If we assume that all the elements are independent, Eq. (7.1a) becomes 
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Figure 7.1 A series system of k components. 


k 

r s =n Ri (7.1b) 

i= 1 

We will let the single constraint on our design be the cost for illustrative 
purposes, and the total cost, c, is given by the sum of the individual component 
costs, c,: 


k 

c=Z Ci (7.2) 

i=i 

We assume that the system reliability given by Eq. (7.1b) is below the sys¬ 
tem specifications or goals, R g , and that the designer must improve the reli¬ 
ability of the system to meet these specifications. (In the highly unusual case 
where the initial design exceeds the reliability specifications, the initial design 
can be used with a built-in safety factor, or else the designer can consider using 
cheaper shorter-lifetime parts to save money; the latter is sometimes a risky 
procedure.) We further assume that the maximum allowable system cost, Co, is 
in general sufficiently greater than c so that the funds can be expended (e.g., 
redundant components added) to meet the reliability goal. If the goal cannot 
be reached, the best solution is the one with the highest reliability within the 
allowable cost constraint. 

In the case where more than one solution exceeds the reliability goal within 
the cost constraint, it is useful to display a number of “good” solutions. Since 
we wish the mathematical optimization to serve a practical engineering design 
process, we should be aware that the designer may choose to just meet the 
reliability goal with one of the suboptimal solutions and save some money. 
Alternatively, there may be secondary factors that favor a good suboptimal 
solution (e.g., the sensitivity and risk factors discussed in the preceding sec¬ 
tion). 

There are three conventional approaches to improving the reliability of the 
system posed in the preceding paragraph: 

1. Improve the reliability of the basic elements, r,-, by allocating some or 
all of the cost budget, co, to fund redesign for higher reliability. 

2. Place components in parallel with the subsystems that operate contin- 
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Figure 7.2 The choice of redundant components to optimize the reliability of the 
series system of Fig. 7.1. 


uously (see Fig. 7.2). This is ordinary parallel redundancy (hot redun¬ 
dancy). 

3. Place components in parallel (standby) with the k subsystems and switch 
them in when an on-line failure is detected (cold redundancy). 

There are also strategies that combine these three approaches. Such com¬ 
bined approaches, as well as reliability improvement by redesign, are discussed 
later in this chapter and also in the problems. Most of the chapter focuses on the 
second and third approaches of the preceding list—hot and cold redundancy. 


7.4 PARALLEL AND STANDBY REDUNDANCY 

7.4.1 Parallel Redundancy 

Assuming that we employ parallel redundancy (ordinary redundancy, hot 
redundancy) to optimize the system reliability, R s , we employ rif- elements in 
parallel to raise the reliability of each subsystem that we denote by R^ (see 
Fig. 7.2). 

The reliability of a parallel system of n* independent components is most 
easily formulated in terms of the probability of failure (1 - r,)”' . For the struc¬ 
ture of Fig. 7.2 where all failures are independent, Eq. (7.1b) becomes 

k 

Rs = n (1 - [1 - r,n (7.3) 

i= 1 

and Eq. (7.2) becomes 

k 

C = Z riiCi (7.4) 

/=i 

We can develop a similar formulation for standby redundancy. 

7.4.2 Standby Redundancy 

In the case of standby systems, it is well known that the probability of failure 
is governed by the Poisson distribution (see Section A5.4). 
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n x P P 

P(x; M ) = -^— (7.5) 

x! 

where 

x =the number of failures 

ju =the expected number of failures 

A standby subsystem succeeds if there are fewer failures than the number 
of available components, x k < n k , thus, for a system that is to be improved by 
standby redundancy, Eqs. (7.3) and (7.4) becomes 

k Xk = n k - 1 

Rs = n X P(x k \ii k ) (7.6) 

i = 1 Xk = 0 

and, of course, the system cost is still computed from Eq. (7.4). 


7.5 HIERARCHICAL DECOMPOSITION 

This section examines the way a designer deals with a complex problem and 
attempts to extract the engineering principles that should be employed. This 
leads to a number of viewpoints, from which some simple approaches emerge. 
The objective is to develop an approach that allows the designer to decompose 
a complex system into a manageable architecture. 

7.5.1 Decomposition 

Systems engineering generally deals with large, complex structures that, when 
taken as a whole (in the gestalt), are often beyond the “intellectual span of 
control.” Thus the first principle in approaching such a design is to decompose 
the problem into a hierarchy of subproblems. This initial decomposition stops 
when the complexity of the resulting components is reduced to a level that puts 
it within the “intellectual span of control” of one manager or senior designer. 
This approach is generally called divide and conquer and is presented for use 
on complex problems in books on algorithms [Aho, 1974, p. 60; Cormen, 1992, 
p. 12]. The term probably comes from the ancient political maxim divide et 
impera (“divide and rule”) cited by Machiavelli [Bartlett, 1968, p. 150b], or 
possibly early principles of military strategy. 

7.5.2 Graph Model 

Although the decomposition of a large system is generally guided by expe¬ 
rience and intuition, there are some guidelines that can be used to guide the 
process. We begin by examining the structure of the decomposition. One can 
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Depth 3 



Height 3 


Figure 7.3 A tree model of a hierarchical decomposition illustrating some graph 
nomenclature. 


describe a hierarchical block diagram of a system in more precise terms if 
we view it as a mathematical graph [Cormen, 1992, pp. 93-94]. We replace 
each box in the block diagram by a vertex (node) and leaving the connecting 
lines that form the edges (branches) of the graph. Since information can flow in 
both directions, this is an undirected graph; if information can flow in only one 
direction, however, the graph is a directed graph, and an arrowhead is drawn on 
the edge to indicate the direction. A path in the graph is a continuous sequence 
of vertices from the start vertex to the end vertex. If the end vertex is the same 
as the start vertex, then this (closed) path is called a cycle (loop). A graph 
without cycles where all the nodes are connected is called a tree (the graph 
corresponding to a hierarchical block diagram is a tree). The top vertex of a 
tree is called the root (root node). In general, a node in the tree that corresponds 
to a component with subcomponents is called a parent of the subcomponents, 
which are called children. The root node is considered to be at depth 0 (level 
0); its children are at depth 1 (level 1). In general, if a parent node is at level n, 
then its children are at level n + 1. The largest depth of any vertex is called the 
depth of the tree. The number of children that a parent has is the out-degree, 
and the number of parents connected to a child is the in-degree. A node that 
has no children is the end node (terminal node) of a path from the root node 
and is called a leaf node (external node). Nonleaf nodes are called internal 
nodes. An example illustrating some of this nomenclature is given in Fig. 7.3. 

7.5.3 Decomposition and Span of Control 

If we wish our decomposition to be modeled by a tree, then the in-degree must 
always be one to prevent cycles or inputs to a stage entering from more than 
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one stage. Sometimes, however, it is necessary to have more than one input 
to a node, in which case one must worry about synchronization and coupling 
between the various nodes. Thus, if node x has inputs from nodes p and q, 
then any change in either p or q will affect node x. Imposing this restriction 
on our hierarchical decomposition leads to simplicity in the interfacing of the 
various system elements. 

We now discuss the appropriate size of the out-degree. If we wish to decom¬ 
pose the system, then the minimum size of the out-degree at each node must 
be two, although this will result in a tree of great height. Of course, if any node 
has a great number of children (a large out-degree), we begin to strain the intel¬ 
lectual span of control. The experimental psychologist Miller [1956] studied a 
large number of experiments related to sensory perception and concluded that 
humans can process about 5-9 levels of “complexity.” (A discussion of how 
Miller’s numbers relate to the number of mental discriminations that one can 
make appears in Shooman [1983, pp. 194, 195].) If we specify the out-degree 
to be seven for each node and all the leaves (terminal nodes) to be at level 
(depth) h, then the number of leaves at level h (NL /,) is given by 

NL h = l h (7.7) 

In practice, each leaf is the lowest level of replaceable unit, which is gen¬ 
erally called a line replaceable unit (LRU). In the case of software, we would 
probably call the analog of an LRU a module or an object. The total number 
of nodes, N, in the graph can be computed if we assume that all the leaves 
appear at level h. 

N = NLo+NLi +NL 2 + ---+NL h (7.8a) 

If each parent node has seven children, Eq. (7.8a) becomes 

iV=l+7 + 7 2 +- \-l h (7.8b) 

Using the formula for the sum of the terms in a geometric progression, 

N = a(r n - l)/(r - 1) (7.9a) 

where 

r = the common ratio (in our case, 7) 
n = the number of terms (in our case, h + 1) 
a = the first term (in our case, 1) 

Substitution in Eq. (7.9a) yields 


N = (7 ,i + 1 - l)/6 (7.9b) 

If h =2, we have N = (7 3 - 1) /6 = 57. We can check this by substitution in 
Eq. (7.8b), yielding 1 + 7 +49 =57. 
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7.5.4 Interface and Computation Structures 

Another way of viewing a decomposition structure is to think in terms of two 
classes of structures, interfaces, and computational elements—a breakdown 
that applies to either hardware or software. In the case of hardware, the com¬ 
putational elements are LRUs; for software, they are modules or classes. In 
the case of hardware, the interfaces are analog or digital signals (electrical, 
light, sound) passed from one element (depth, level) to another; the joining of 
mechanical surfaces, hydraulics or pneumatic fluids; or similar physical phe¬ 
nomena. In the case of software, the interfaces are generally messages, vari¬ 
ables, or parameters passed between procedures or objects. Both hardware and 
software have errors (failure rates, reliability) associated with either the com¬ 
putational elements or the interfaces. If we again assume that leaves appear 
only at the lowest level of the tree, the number of computational elements 
is given by the last term in Eq. (7.8a), AIL/,. In counting interfaces, there is 
the interface out of an element at level i and the interface to the correspond¬ 
ing element at level i + 1. In electrical terms, we might call this the output 
impedance and the corresponding input impedance. In the case of software, 
we would probably be talking about the passing of parameters and their scope 
between a procedure call and the procedure that is called, or else the passing 
of messages between classes and objects. For both hardware and software, we 
count the interface (information-out-information-in) pair as a single interface. 
Thus all modules except level 0 have a single associated interface pair. There 
is no structural interface at level 0; however, let us consider the system specifi¬ 
cations as a single interface at level 0. Thus, we can use Eqs. (7.8) and (7.9) to 
count the number of interfaces, which is equivalent to the number of elements. 
Continuing the foregoing example where h =2, we have 7 2 = 49 computational 
elements and (7 3 — 1 )/6 = 57 interfaces. Of course, in a practical example, not 
all the leaves will appear at depth (level) h, since some of the paths will ter¬ 
minate before level /;: thus the preceding computations and formulas can only 
be considered upper bounds on an actual (less-idealized) problem. 

One can use these formulas for many interfaces and computational units to 
conjecture models for complexity, errors, reliability, and cost. 


7.5.5 System and Subsystem Reliabilities 

The structure of the system at level 1 in the graph model of the hierarchical 
decomposition is a group of subsystems equal in number to the out-degree of 
the root node. Based on Miller’s work, we have decided to let the out-degree 
be 7 (or 5 to 9). As an example, let us consider an overview of an air traffic 
control (ATC) system for an airport [Gilbert, 1973, p. 39, Fig. 61]. Level 0 in 
our decomposition is the “air traffic control system.” At level 1, we have the 
major subsystems that are given in Table 7.1. 

An expert designer of a new ATC system might view things a little dif¬ 
ferently (in fact, two expert designers working for different companies might 
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TABLE 7.1 A Typical Air Traffic Control System at Level 1 

• Tracking radar and associated computer. 

• Air traffic control (ATC) displays and display computers. 

• Voice communications with pilot. 

• Transponders on the aircraft (devices that broadcast a digital 
identification code and position information). 

• Communications from other ATC centers. 

• Weather information. 

• The main computer. 


each come up with a slightly different model even at level 1), but the list in 
Table 7.1 is sufficient for our discussions. We let X\ represent the success of 
the tracking radar, AS represent the success of the controller displays, and so 
on up to Xj, which represents the success of the main computer. We can now 
express the reliability of the system in terms of events Xi-Xj. At this high 
a level, the system will only succeed if all the subsystems succeed; thus the 
system reliability, R s , can be expressed as 


r s = p(x l n x 2 n • • • n x 7 ) (7. io) 

If the seven aforementioned subsystems are statistically independent, then 
Eq. (7.10) becomes 


R s = P(X 1 )P(X 2 )-P(X 1 ) (7.11) 

In all likelihood, the independent assumption at this high level is valid; it is 
unlikely that one could postulate mechanisms whereby the failure of the track¬ 
ing radar would cause failure of the controller displays. The common mode 
failure mechanisms that would lead to dependence (such as a common power 
system or a hurricane) are quite unlikely. System designers would be aware 
that a common power system is a vulnerable point and therefore would not 
design the system with this feature. In all likelihood, the systems will have 
independent computer systems. Similarly, it is unlikely that a hurricane would 
damage both the tracking radar and the controller displays; the radar should 
be designed for storm resistance, and the controller displays should be housed 
in a stormproof building; moreover, the occurrence of a hurricane should be 
much less frequent than that of other possible forms of failure modes. Thus 
it is a reasonable engineering assumption that statistical independence exists, 
and Eq. (7.11) is a valid simplification of Eq. (7.10). 

Because of the nature of the probabilities, that is, they are bounded by 0 and 
1, and also because of the product nature of Eq. (7.11), we can bound each 
of the terms. There is an infinite number of values of P(X i), P(X 2 ),..., P(Xj) 
that satisfies Eq. (7.11); however, the smallest value of P(X\) occurs when 
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PUG), ..., P(Xj) assume their largest values- 
solution for each of the subsystems to yield 

—that is, unity. We can repeat this 
a set of minimum values. 

P(X \ )> R s 

(7.12a) 

PUG) > Rs 

(7.12b) 

and so on up to 


PUG) > R s 

(7.12c) 

These minimum bounds are true in general for any subsystem if the system 
structure is series; thus we can write 

P{X,) > R s 

(7.13) 


The equality only holds in Eqs. (7.12) and (7.13) if all the other subsystems 
have a reliability equal to unity (i.e., they never fail); thus, in the real world, the 
equality conditions can be deleted. These minimum bounds will play a large 
role in the optimization technique developed later in this chapter. 


7.6 APPORTIONMENT 

As was discussed in the previous section, one of the first tasks in approaching 
the design of a large, complex system is to decompose it. Another early task 
is to establish reliability allocations or budgets for the various subsystems that 
emerge during the decomposition, a process often referred to as apportionment 
or allocation. At this point, we must discuss the difference between a math¬ 
ematician’s and an engineer’s approach to optimization. The mathematician 
would ask for a precise system model down to the LRU level, the failure rate, 
and cost, weight, volume, etc., of each LRU; then, the mathematician invokes 
an optimization procedure to achieve the exact optimization. The engineer, on 
the other hand, knows that this is too complex to calculate and understand in 
most cases and therefore seeks an alternate approach. Lurthermore, the engi¬ 
neer knows that many of the details of lower-level LRUs will not be known 
until much later and that estimates of their failure rates at that point would be 
rather vague, so he or she adopts a much simpler design approach: beginning a 
top-down process to apportion the reliability goal among the major subsystems 
at depth 1. 

Apportionment has historically been recognized as an important reliability 
system goal [AGREE Report, 1957, pp. 52-57; Henney, 1956, Chapter 1; Von 
Alven, 1964, Chapter 6]; many of the methods discussed in this section are 
an outgrowth of this early work. We continue to assume that there are about 
7 subsystems at depth 1. Our problem is how to allocate the reliability goal 
among the subsystems, for which several procedures exist on which to base 
such an allocation early in the design process; these are listed in Table 7.2. 
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TABLE 7.2 Approaches to Apportionment 


Approach 

Basis 

Comments 

Equal weighting 

All subsystems should have 
the same reliability. 

Easy first attempt. 

Relative difficulty 

Some knowledge of relative 
cost or difficulty to 
improve subsystem 
reliability. 

Heuristic method requiring 
only approximate 
ordering of cost 
of difficulty. 

Relative failure 

Requires some knowledge of 

Easier to use than the 

rates 

the relative subsystem 
failure rates. 

relative difficulty 
method. 

Albert’s method 

Requires an initial estimate of 
the subsystem reliabilities. 

A well-defined algorithm 
is used that is based 
on assumptions about 
the improvment-effort 
function. 

Stratified 

optimization 

Requires detailed model of 
the subsystem. 

Discussed in Section 7.6.5. 


7.6.1 Equal Weighting 

The simplest approach to apportionment is to assume equal subsystem reli¬ 
ability, r. In such a case, Eq. (7.11) becomes 


R s = P(X j )P(X 2 ) • • • P(X 7 ) = r 7 (7.14a) 

For the general case of n independent subsystems in series, 

R s = r n (7.14b) 

Solving Eq. (7.14a) for r yields 

>•-(/?,) 1/7 (7.15a) 

r = (R s ) l/n (7.15b) 

This equal weighting apportionment is so simple that it is probably one of the 
brst computations made in a project. System engineers typically “whip out” 
their calculators and record such a calculation on the back of an envelope or 
a piece of scrap paper during early discussions of system design. 

As an example, suppose that we have a system reliability goal of 0.95, in 
which case Eq. (7.15a) would yield an apportioned goal of r = 0.9927. Of 
course, it is unlikely that it would be equally easy or costly to achieve the 
apportioned goal of 0.9927 for each of the subsystems. Thus this method gives 
a ballpark estimate, but not a lot of time should be spent using it in the design 
before a better method replaces it. 
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7.6.2 Relative Difficulty 

Suppose that we have some knowledge of the subsystems and can use it in 
the apportionment process. Assume that we are at level 1, that we have seven 
subsystems to deal with, and that we know for three of the subsystems achiev¬ 
ing a high level of reliability (e.g., the level required for equal apportionment) 
will be difficult. We envision that these three systems could meet their goals if 
they can be realized by two parallel elements. We then would have reliability 
expressions similar to those of Eq. (7.14b) for the four “easier” systems and 
a reliability expression 2 r - r 2 for the three “harder systems.” The resultant 
expression is 


R s = r 4 (2r - r 2 ) 3 (7.16) 

Solving Eq. (7.16) numerically for a system goal of 0.95 yields r = 0.9874. 
Thus the four “easier” subsystems would have a single system with a reliabil¬ 
ity of 0.9874, and the three harder systems would have two parallel systems 
with the same reliability. Another solution is to keep the goal of r = 0.9927, 
calculated previously for the easier subsystems. Then, the three harder systems 
would have to meet the goal of 0.95/0.99274 =0.9783. The three harder sys¬ 
tems would have to meet a somewhat lower goal: (2r - r 2 ) 3 = 0.9783, or r = 
0.953. Other similar solutions can easily be obtained. 

The previous paragraph dealt with unequal apportionments by considering 
a parallel system for the three harder systems. If we assume that parallel sys¬ 
tems are not possible at this level, we must choose a solution where the easier 
systems exceed a reliability of 0.9927 so that the harder systems can have a 
smaller reliability. For convenience, we could rewrite Eq. (7.11) in terms of 
unreliabilities, r, =1 — u n obtaining 


R s = (1 - 1/0(1 - w 2 ) • • • (1 - m 7 ) (7.17a) 

Again, suppose there are four easier systems with a failure probability of 
u i = «2 ii} - <4 = u. The harder systems will have twice the failure proba¬ 
bility M 5 = M6=2 u, and Eq. (7.17a) becomes 


R s = (1 - m) 4 (1 - 2m) 3 


that yields a 7th-order polynomial. 

The easiest way to solve the polynomial is through trial and error with a 
calculator or by writing a simple program loop to calculate a range of values. 
The equal reliability solution was r = 0.9927 = 1 - 0.0073. If we try r easy = 
0.995 = 1 0.005, rhard = 0-99 = 1 — 0.01, and substitute in Eq. (7.17a), the 

result is 


0.951038 = (0.995) 4 (0.99) 3 


(7.17b) 
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Trying some slightly larger values of it results in a solution of 

0.950079 = (0.9949) 4 (0.9898) 3 (7.17c) 

The accuracy of this method depends largely on how realistic the guesses are 
regarding the hard and easy systems. The method of the next section is similar, 
but the calculations are easier. 

7.6.3 Relative Failure Rates 

It is simpler to use knowledge about easier and harder systems during appor¬ 
tionment if we work with failure rates. We assume that each subsystem has a 
constant-failure rate A, and that the reliability for each subsystem is given by 

r, = ^ x ' (7.18a) 

and substitution of Eq. (7.18a) into Eq. (7.11) yields 

R s = P(X i )P(X 2 ) ■ ■ ■ P(X 7 ) = e~ Xl e- Xl ■ ■ • W x? (7.18b) 

and Eq. (7.18b) can be written as 


Rs = e~ Xs 


(7.19) 


where 


A., - \i + A 2 + • • • + A 7 

Continuing with our example of the previous section, in which the goal is 
0.95, the four “easier” systems have a failure rate of A, and the three harder 
ones have a failure rate of 5 A, Eq. (7.19) becomes 

R s = 0.95 = e~ l9Xt (7.20) 

Solving for A t, we obtain A t = 0.0026996, and the reliabilities are e °- 0026996 
= 0.9973, and e 5x 0.0026996 = q 9555 Thus our apportioned goals for the four 
easier systems are 0.9973; for the three harder systems, 0.9865. As a check, 
we see that 0.9973 4 x 0.9865 3 x 0.9497. Clearly, one can use this procedure 
to achieve other allocations based on some relative knowledge of the nomi¬ 
nal failure rates of the various subsystems or on how difficult it is to achieve 
various failure rates. 


7.6.4 Albert’s Method 

A very interesting method that results in an algorithm rather than a design 
procedure is known as Albert’s method and is based on some analytical prin- 
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ciples [Albert, 1958; Lloyd, 1977, pp. 267-271]. The procedure assumes that 
initially there are some estimates of what reliabilities can be achieved for the 
subsystems to be apportioned. In terms of our notation, we will say that P(X i), 
P{Xi), ..., PiX-j) are given by some nominal values: R \, /A,.... R 1 . Note that 
we continue to speak of seven subsystems at level 1; however, this clearly can 
be applied to any number of subsystems. The fact that we assume nominal 
values for the R t implies that we have a preliminary design. However, in any 
large system there are many iterations in system design, and this method is 
quite useful even if it is not the first one attempted. Adopting the terminology 
of government contracting (which generally has parallels in the commercial 
world), we might say that the methods of Sections 7.6.1-7.6.3 are useful in for¬ 
mulating the request for proposal (RFQ) (the requirements) and that Albert’s 
method is useful during the proposal preparation phase (specifications and pro¬ 
posed design) and during the early design phases after the contract award. A 
properly prepared proposal will have some early estimates of the subsystem 
reliabilities. Furthermore, we assume that the system specification or goal is 
denoted by R g , and the preliminary estimates of the subsystem reliabilities yield 
a system reliability estimate given by 

R,MRiRi - Ri (7.21) 

If the design team is lucky, R s > R g , and the first concerns about reliability 
are thus satisfied. In fact, one might even think about trading off some reli¬ 
ability for reduced cost. An experienced designer would tell us that this almost 
never happens and that we are dealing with the situation where R s < R g . This 
means that one or more of the Rj values must be increased. Albert’s method 
deals with finding which of the subsystem reliability goals must be increased 
and by how much so that R s is increased to the point where R s = R„. 

Based on the bounds developed in Eq. (7.13), we can comment that any sub¬ 
system reliability that is less than the system goal, R, < R g , must be increased 
(others may also need to be increased). For convenience in developing our 
algorithm, we assume that the subsystems have been renumbered so that the 
reliabilities are in ascending order: R i < /A < ■ ■ ■ < /A. Thus, in the special 
case where Rj < R g , all the subsystem goals must be increased. In this case, 
Albert’s method reduces to equal apportionment and Eqs. (7.14) and (7.15) 
hold. In the more general case, j of the i subsystems must have the reliability 
increased. Albert’s method requires that all the j subsystems have their reli¬ 
ability increased to the same value, r, and that the reliabilities of the (i - j) 
subsystems remain unchanged. Thus, Eq. (7.21) becomes 

R g = R\Ri • ■ • RjRj+i ■ • • Ri (7.22) 


where 


R\ — R2 — • • • - Rj — r 


(7.23) 
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Substitution of Eq. (7.23) into Eq. (7.22) yields 

R g = (rJ)(R j+l ---R 7 ) (7.24) 

We solve Eq. (7.24) for the value of r (or, more generally, r,): 

r j = R g /(R j+ i--R 7 ) (7.25a) 

r = (Rg/[Rj+i • • • (7.25b) 

Equations (7.22)-(7.25) describe Albert’s method, but an important step 
must still be discussed: how to determine the value of j. Again, we turn to 
Eq. (7.13) to shed some light on this question. We can place a lower bound 
on j and say that all the subsystems having reliabilities smaller than or equal 
to the goal, Rj < R g , must be increased. It is possible that if we choose j equal 
to this lower bound and substitute into Eq. (7.25b), the computed value of 
r will be >1, which is clearly impossible; thus we must increase j by 1 and 
try again. This process is repeated until the values of r obtained are <1. We 
now have a feasible value for j, but we may be requiring too much “effort” to 
raise all the 1 through j subsystems to the resulting high value of r. It may be 
better to increment j by 1 (or more), reducing the value of r and “spreading” 
this value over more subsystems. Albert showed that based on certain effort 
assumptions, the optimum value of j is bounded from above when the value 
for r first decreases to the point where r < Rj. The optimum value of j is the 
previous value of j, where r > Rj. More succinctly, the optimum value for j is 
the largest value for j, where r > Rj. Clearly it is not too hard to formulate a 
computer program for this algorithm; however, since we are assuming about 
seven systems and have bounded j from below and above, the most efficient 
solution is probably done with paper, pencil, and a scientific calculator. 

The reader may wonder why we have spent quite a bit of time explain¬ 
ing Albert’s method rather than just stating it. The original exposition of the 
method is somewhat terse, and the notation may be confusing to some; thus the 
enhanced development is warranted. The remainder of this section is devoted 
to an example and a discussion of when this method is “optimum.” The reader 
will note that some of the underlying philosophy behind the method can be 
summarized by the following principle: “The most efficient way to improve 
the reliability of a series structure (sometimes called a chain) is by improving 
the weakest links in the chain.” This principle will surface a few more times 
in later portions of this chapter. 

A simple example should clarify the procedure. Suppose that we have four 
subsystems with initial reliabilities Ri = 0.7, R 7 = 0.8, R 7 = 0.9, and R 4 = 0.95, 
and the system reliability goal is R g = 0.8. The existing estimates predict a 
system reliability of R s = 0.7 x 0.8 x 0.9 x 0.95 = 0.4788. Clearly, some or all 
of the subsystem goals must be raised for us to meet the system goal. Based on 
Eq. (7.13), we know that we must improve subsystems 1 and 2, so we begin 
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our calculations at this point. The system reliability goal, R g = 0.8, and Eq. 
(7.25b) yields 

r = (Rg/[Rj+i ■ ■ -Ri]) l/i = (0.8/0.9 x 0.95) 1/2 = (0.93567) 0 ' 5 = 0.96730 (7.26) 

Since 0.96730 > 0.9, we continue our calculation. We now recompute for 
subsystems 1, 2, and 3, and Eq. (7.25b) yields 

r = (0.8/0.95) 1/3 = 0.9443 (7.27) 

Now, 0.9443 < 0.95, and we choose the previous value of / = 2 as our optimum. 
As a check, we now compute the system reliability. 

0.96730 x 0.96730 x 0.9 x 0.95 = 0.7999972 = R g 

which equals our goal of 0.8 when rounded to one place. Thus, the conclusion 
from the use of Albert’s method is that the apportionment goals for the four 
systems are R\ = Rn = 0.96730; Rj, = 0.90; and R 4 = 0.95. This solution assumes 
equal effort for improving the reliability of all subsystems. 

The use of Albert’s method produces an optimum allocation policy if the 
following assumptions hold [Albert, 1958; Lloyd, 1977, pp. 267-271]: 

1. Each subsystem has the same effort function that governs the amount of 
effort required to raise the reliability of the / th subsystem from R, to r,. 
This effort function is denoted by G(R,, r,). and increased effort always 
increases the reliability: G(/?,-, r,j > 0. 

2. The effort function G(x,y ) is nondecreasing in y for fixed x, that is, given 
an initial value of R n it will always require more effort to increase r, to 
a higher value. For example, 

G(0.35,0.65) <G(0.35,0.75) 

The effort function G(x,y) is nonincreasing in x for fixed y. that is, given 
an increase to r,, it will always require less effort if we start from a larger 
value of Rj. For example, 

G(0.25,0.65) >G(0.35,0.65) 

3. If x < y < z, then G(x, y) + G(y,z ) = G(x, z). This is a superposition 
(linearity) assumption that states that if we increase the reliability in two 
steps, the sum of the efforts for each step is the same as if we did the 
increase in a single step. 

4. G(0, x) has a derivative h(x) such that xh(x) is strictly increasing in (0 < 
x < 1). 


Team-Fly 
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The proof of the algorithm is given in Albert [1958]. If the assumptions 
of Albert’s method are not met, the equal effort rule is probably violated, for 
which the methods of Sections 7.6.2 and 7.6.3 are suggested. 

7.6.5 Stratified Optimization 

In a very large system, we might consider continuing the optimization to level 
2 by applying apportionment again to each of the subsystem goals. In fact, 
we can continue this process until we reach the LRU level and then utilize 
Eqs. (7.3) or (7.6) (or else improve the LRU reliability) to achieve our system 
design. Such decisions require some intuition and design experience on the 
part of the system engineers; however, the foregoing methods provide some 
engineering calculations to help guide intuition. 

7.6.6 Availability Apportionment 

Up until now, the discussion of apportionment has dealt entirely with system 
and subsystem reliabilities. Now, we discuss the question of how to proceed 
if system availabilities are to be apportioned. Under certain circumstances, the 
subsystem availabilities are essentially independent, and the system availabil¬ 
ity is given by the same formula as for the reliabilities, with the availabili¬ 
ties replacing the reliabilties. A discussion of availability modeling in general, 
and a detailed discussion of the circumstances under which such substitutions 
are valid appears in Shooman [1990, Appendices F]. One situation in which 
the availabilities are independent is where each subsystem has its own repair¬ 
man (or repaircrew). This is called repairman decoupling in Shooman [1990, 
Appendices F-4 and F-5]. In the decoupled case, one can use the same sys¬ 
tem structural model that is constructed for reliability analysis to compute sys¬ 
tem availability. The steady-state availability probabilities are substituted in the 
model just as the reliability probabilities would be. Clearly, this is a convenient 
situation and is often, but not always, approximately valid. 

Suppose, however, that the same repairman or repaircrew serves one or more 
subsystems. In such a case, there is the possibility that a failure will occur in 
subsystem y while the repairman is still busy working on a repair for subsystem 
x. In such a case, a queue of repair requests develops. The queuing phenomena 
result in dependent coupled subsystems that can be denoted as being repair¬ 
man coupling. When repairman coupling is significant, one should formulate a 
Markov model to solve for the resulting availability. Since Markov modeling 
for a large subsystem can be complicated, as the reader can appreciate from 
the analytical solutions of Chapter 3, a practical designer would be happy to 
use a decoupled solution even if the results were only a good approximation. 

Intuition tells us that the possibility of a queue forming is a function of the 
ratio of repair rate to failure rate (X/yu.). If the repair rate is much larger than the 
failure rate, the approximation should be quite satisfactory. These approxima¬ 
tions were explored extensively in Section 4.9.3, and the reader should review 
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the results. We can explore the decoupled approximation again by considering 
a slightly different problem than that in Chapter 4: two series subsystems that 
are served by the same repairman. Returning to the results derived in Chapter 
3, we can compute the exact availability using the model given in Fig. 3.16 
and Eqs. (3.71a-c). This model holds for two identical elements (series, paral¬ 
lel, and standby). If we want the model to hold for two series subsystems, we 
must compute the probability that both elements are good, which is P so . We 
can compute the steady-state solution by letting v —» 0 in Eqs. (3.71a-c), as 
was discussed in Chapter 3, and solving the resulting equations. The result is 


H n 


“ Pso XX' + X>" + mV 


(7.28a) 


This result is derived in Shooman [1990, pp. 344—346]. For ordinary (not 
standby) two-element systems, X' = 2X and ix' = h" = ix. Substitution yields 


A«, 


2X 2 + 2Xpt + n~ 


(7.28b) 


The approximate result is given by the probability that both elements are up, 
which is the product of the steady-state availability for a single element p/(X 
+ /x) : 


M m 

M + X m -t' X 


(7.29) 


We can compare the two expressions for various values of (m/X) in Table 
7.3, where we have assumed that the values of /x and X for the two elements 
are identical. From the third column in Table 7.3, we see that the ratio of 
the approximate unavailability (1 — A~) to the exact unavailability (1 — A=) 
approaches unity and is quite acceptable in all the cases shown. Of course, 
one might check the validity of the approximation for more complex cases; 
however, the results are quite encouraging, and we anticipate that the approx¬ 
imation will be applicable in many cases. 


TABLE 7.3 Comparison of Exact and Approximate Availability Formulas 


Ratio fx/\ 

Approximate 

Formula: 

Eq. (7.30), A = 

Exact 

Formula: 

Eq. (29b), A = 

Ratio of 
Unavailability: 

(1 “ A =)/(! ~ A =) 

1 

0.25 

0.20 

0.94 

10 

0.826496 

0.819672 

0.96 

100 

0.9802962 

0.980199961 

0.995 
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7.6.7 Nonconstant-Failure Rates 

In many cases, the apportionment approaches discussed previously depend on 
constant-failure rates (see especially Table 7.2, third row). If the failure rates 
vary with time, it is possible that the optimization results will hold only over 
a certain time range and therefore must be recomputed for other ranges. The 
analyst should consider this approach if nonconstant-failure rates are signifi¬ 
cant. In most cases, detailed information on nonconstant-failure rates will not 
be available until late in design, and approximate methods using upper and 
lower bounds on failure rates or computations for different ranges assuming 
linear variation will be adequate. 


7.7 OPTIMIZATION AT THE SUBSYSTEM LEVEL VIA 
ENUMERATION 

7.7.1 Introduction 

In the previous section, we introduced apportionment as an approximate opti¬ 
mization procedure at the system level. Now, we assume that we are at the 
subsystem level. At this point, we assume that each subsystem is at a level 
where we can speak of subsystem redundancy and where we can now con¬ 
sider exact optimization. (It is possible that in some smaller problems, the use 
of apportionment at the system level as a precursor is not necessary and we can 
begin exact optimization at this level. Also, it is possible that we are dealing 
with a system that is so complex that we have to apportion the subsystems into 
sub-subsystems—or even lower—before we can speak of redundant elements.) 
In all cases, we view apportionment as an approximate optimization process, 
which may or may not come first. 

The subject of system optimization has been extensively discussed in the reli¬ 
ability literature [Barlow, 1965, 1975; Bellman, 1958; Messinger, 1970; Myers, 
1964; Tillman, 1980] and also in more general terms [Aho, 1974; Bellman, 1957; 
Bierman, 1969; Cormen, 1992; Hiller, 1974]. The approach used was gener¬ 
ally dynamic programming or greedy methods; these approaches will be briefly 
reviewed later in this chapter. This section will discuss a bounded enumeration 
approach [Shooman and Marshall, 1993] that the author proposes as the simplest 
and most practical method for redundancy optimization. We begin our develop¬ 
ment by defining the brute force approach of exhaustive enumeration. 

7.7.2 Exhaustive Enumeration 

This approach is straightforward, but it represents a brute force approach to 
the problem. Suppose we have subsystem i that has five elements and we wish 
to improve the subsystem reliabiity to meet the apportioned subsystem goal 
R,,. If practical considerations of cost, weight, or volume limit us to choosing 
at most a single parallel subsystem for each of the five elements, each of the 
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five subsystems has zero or one element in parallel, and the total number of 
possibilities is 2 5 = 32. Given the powerful computational power of a modern 
personal computer, one could certainly write a computer program and evaluate 
all 32 possibilities in a short period of time. The designer would then choose 
the combination with the highest reliability or some other combination of good 
properties and use the complete set of possibilities as the basis of design. As 
previously stated, sometimes a close suboptimum solution is preferred because 
of risk, uncertainty, sensitivity, or other factors. Suppose we could consider at 
most two parallel subsystems for each of the five elements, in which case the 
total number of possibilities is 3 5 = 243. This begins to approach an unwieldy 
number for computation and interpretation. 

The actual number of computations involved in exhaustive enumeration is 
much larger if we do not impose a restriction such as “considering at most two 
parallel subsystems for each of the five elements.” To illustrate, we consider 
the following two examples [Shooman, 1994]: 

Example 1: The initial design of a system yields 3 subsystems at the first level 
of decomposition. The system reliability goal, R„, is 0.9 for a given number 
of operating hours. The initial estimates of the subsystem reliabilities are R\ = 
0.85, Rj = 0.5, and R\ = 0.3. Parallel redundancy is to be used to improve the 
initial design so that it meets the reliability goal. The constraint is cost; each 
subsystem is assumed to cost 1 unit, and the total cost budget is 16 units. 

The existing estimates predict an initial system reliability of R so = 0.85 x 0.5 
x 0.3 = 0.1275. Clearly, some or all of the subsystem reliabilities must be raised 
for us to meet the system goal. Lacking further analysis, we can state that the 
initial system costs 3 units and that 13 units are left for redundancy. Thus we 
can allocate 0 or 1 or 2 or any number up to 13 parallel units to subsystem 1, a 
similar number to subsystem 2, and a similar number to subsystem 3. An upper 
bound on the number of states that must be considered would therefore be 14 3 
= 2,744. Not all of these states are possible because some of them violate the 
weight constraint; for example, the combination of 13 parallel units for each 
of the 3 subsystems costs 39 units, which is clearly in excess of the 13-unit 
budget. However, even the actual number will be too cumbersome if not too 
costly in computer time to deal with. In the next section, we will show that by 
using the bounded enumeration technique, only 10 cases must be considered! 

Example 2: The initial design of a system yields 5 subsystems at the first level 
of decomposition. The system reliability goal, R g , is 0.95 for a given number 
of operating hours. The initial estimates of the subsystem reliabilities are R\ 
= 0.8, 7 ?2 = 0.8, = 0.8, = 0.9, and R$ = 0.9. Parallel redundancy is to 
be used to improve the initial design so that it meets the reliability goal. The 
constraint is cost; the subsystems are assumed to cost 2, 2, 2, 3, and 3 units, 
respectively, and the total cost budget is 36 units. 

The existing estimates predict an initial system reliability of R so = 0.8 3 x 0.9 2 
= 0.41472. Clearly, some or all of the subsystem reliabilities must be raised 
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for us to meet the system goal. Lacking further analysis, we can state that the 
initial system costs 12 units; thus we can allocate up to 24 cost units to each of 
the subsystems. For subsystems 1, 2, and 3, we can allocate 0 or 1 or 2 or any 
number up to 12 parallel units. For subsystems 4 and 5, we can allocate 0 or 1 
or 2 or any number up to 8 parallel units. Thus an upper bound on the number 
of states that must be considered would be 13 3 x 9 2 = 177,957. Not all of these 
states are possible because some of them violate the cost constraint. In the next 
section, we will show that by using the bounded enumeration technique, only 
31 cases must be considered! 

Now, we begin our discussion of the significant and simple method of opti¬ 
mization that results when we apply bounds to constrain the enumeration pro¬ 
cess. 


7.8 BOUNDED ENUMERATION APPROACH 
7.8.1 Introduction 

An analyst is often so charmed by the neatness and beauty of a closed-form 
synthesis process that they overlook the utility of an enumeration procedure. 
Engineering design is inherently a trial-and-error iterative procedure, and sel¬ 
dom are the parameters known so well that an analyst can stand behind a design 
and defend it as the true optimum solution to the problem. In fact, presenting 
a designer with a group of good designs rather than a single one is generally 
preferable since there may be many ancillary issues to consider in making a 
choice. Some of these issues are the following: 

• Sensitivity to variations in the parameters that are only approximately 
known. 

• Risk of success for certain state-of-the-art components presently under 
design or contemplated. 

• Preferences on the part of the customer. (The old cliche about the “Golden 
Rule”—he who has the gold makes the rules—really does apply.) 

• Conflicts between designs that yield high reliability but only moderate 
availability (because of repairability problems), and the converse. 

• Effect of maintenance costs on the chosen solution. 

• Difficulty in mathematically including multiple prioritized constraints 
(some independent multiple constraints are easy to deal with; these are 
discussed below). 

Of course, the main argument against generating a family of designs and 
choosing among them is the effort and confusion involved in obtaining such a 
family. The prediction of the number of cases needed for direct enumeration in 
the two simple examples discussed previously are not encouraging. However, 
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we will now show that the adoption of some simple lower- and upper-bound 
procedures greatly reduces the number of cases that need to be considered and 
results in a very practical and useful approach. 


7.8.2 Lower Bounds 

The discussion following Eqs. (7.1) and (7.2) pointed out that there is an infi¬ 
nite number of solutions that satisfy these equations. However, once we impose 
the constraint that the individual subsystems are made up of a finite number of 
parallel (hot or cold) systems, the problem becomes integer rather than contin¬ 
uous in nature, and a finite but still large number of solutions exists. Our task 
is to eliminate as many of the infeasible combinations as we can in a manner 
as simple as possible. The lower bounds on the system reliability developed in 
Eqs. (7.11), (7.12) and (7.13) allow us to eliminate a large number of combina¬ 
tions that constitute infeasible solutions. These bounds, powerful though they 
may be, merely state the obvious—that the reliability of a series of independent 
subsystems yields a product of probabilities and, since each probability has an 
upper bound of unity, that each subsystem reliability must equal or exceed the 
system goal. To be practical, it is impossible to achieve a reliability of 1 for 
any subsystem; thus each subsystem reliability must exceed the system goal. 
One can easily apply these bounds by enumeration or by solving a logarithmic 
equation. 

The reliability expression for a chain of k subsystems, where each subsystem 
is composed of n, parallel elements, is given in Eq. (7.3). If we allow all the 
subsystems other than subsystem i to have a reliability of unity and we compare 
them with Eq. (7.13), we obtain 

(1 - [1 -r i f i )>R g (7.30) 

We can easily solve this equation by choosing n, = 1, 2, ... , substituting and 
solving for the smallest value of n, that satisfies Eq. (7.30). A slightly more 
direct method is to solve Eq. (7.30) in closed form as an equality: 

(1 - r/ )"' = 1 - R g (7.31a) 

Taking the log of both sides of Eq. (7.31a) and solving yields 

nj = log(l - R g )/ log( 1 - r,) (7.31b) 

The solution is the smallest integer that exceeds the value of n t computed in 
Eq. (7.31b). 

We now show how these bounds apply to Example 1 of the last section. 
Solving Eq. (7.31b) for Example 1 yields 
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n 1 =log(l - R,,)/ log( I - r I) 

= log(l -0.9)/log( 1-0.85) 

= 1.21 (7.32a) 

n 2 = log(l - R g )/\og(\ - r 2 ) 

= log(l - 0.9)/log(l - 0.5) 

= 3.32 (7.32b) 

«3 =log(l - A\,)/log(l - r 3 ) 

= log(l - 0.9)/log(l - 0.3) 

= 6.46 (7.32c) 

Clearly, the minimum values for m, n 2 , and n 2 from the preceding computa¬ 
tions are 2, 4, and 7, respectively. Thus, these three simple computations have 
advanced our design from the original statement of the problem given in Fig. 
7.4(a) to the minimum system design given in Fig. 7.4(b). The subsystem reli¬ 
abilities are given by Eq. (7.33): 


Rt = 1 - (1 - r,T (7.33) 

Substitution yields 

Ri = 1 - (1 - 0.85) 2 = 0.9775 (7.34a) 

R 2 = 1 — (1 — 0.5) 4 = 0.9375 (7.34b) 

R 3 = 1 - (1 - 0.3) 7 = 0.9176 (7.34c) 

R s = R 1 R 2 R 3 = 0.9775 x 0.9375 x 0.9176 = 0.84089 (7.34d) 


The minimum system design represents the first step toward achieving an 
optimized system design. The reliability has been raised from 0.1275 to 0.8409, 
a large step toward achieving the goal of 0.9. Furthermore, only 3 cost units 
are left, so 0, 1, 2, and 3 are the number of units that can be added to the 
minimum design. An upper bound on the number of cases to be considered is 
4 x 4 x 4 = 64 cases, a huge decrease from the initial estimate of 2,744 cases. 
(This number, 64, will be further reduced once we add the upper bounds of 
Section 7.8.3.) In fact, because this problem is now reduced, we can easily 
enumerate exactly how many cases remain. If we allocate the remaining 3 
units to subsystem 1 , no additional units can be allocated to subsystems 2 and 
3 because of the cost constraint. We could label this policy n\ = 5, n 2 = 4, 
and «3 =7. However, the minimum design represents such an important initial 
step that we will now assume that it is always the first step in optimization and 
only deal with increments (deltas) added to the initial design. Thus, instead of 
labeling this policy (case) >i\= 5, n 2 = 4, and n 2 = 7, we will call it A n\ = 3, 
A «2 = 0, and Ar 3 = 0, or incremental policy (3, 0, 0). 
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= 1 n 2=1 « 3 = 1 

O-3>-O-J-o-s-o 

R x = 0.85 R 2 = 0.5 R 3 = 0.3 

C[ = 1 c 2 = 1 c 3 = 1 

tfv 0 = 0.1275 
c s = 3 

(a) Initial Problem Statement 



R sm = 0.8409 
c s = 13 


(b) Minimum System Design 

Figure 7.4 Initial statement of Example 1 and minimum system design. 


We can now apply the same minimum design approach to Example 2. Solv¬ 
ing Eq. (7.31b) for Example 2 yields 


log(l-/?,)/log(l-n) 


log(l -0.95)/log(l -0.8) 


1.86 

(7.35a) 

«2 = >11 

(7.35b) 

log(l - fl,)/log(l - r 4 ) 


log(l -0.95)/ log(l -0.9) 


1.3 

(7.35c) 

tl4 

(7.35d) 


Clearly, the minimum values for n\, n 2 , n 2 , n 4 , and are all 2. The original 
statement of the problem and the minimum system design are given in Fig. 7.5. 
The subsystem reliabilities are given by substitution in Eq. (7.33): 
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n i 

1 = 1 

n 2 = 1 

n 3 = 1 

n 4 = 1 

"5=1 

*1 

= 0.8 

R 2 = 0.8 

R 3 = 0.8 

R 4 = 0.9 

R 5 = 0.9 

C 1 

= 2 

c 2 = 2 

c 3 = 2 

m 

II 

c 5 = 3 


R So = 0.41472 
c s = 12 


(a) Initial Problem Statement 


77 j = 2 77 2 = 2 n 3 = 2 n 4 = 2 n 5 = 2 



R { = 0.96 R 2 = 0.96 R 3 = 0.96 R 4 = 0.99 R 5 = 0.99 

= 4 c 2 = 4 c 3 = 4 c 4 = 6 c 5 = 6 

= 0.8671 
Cj = 24 

(b) Minimum System Design 

Figure 7.5 Initial statement of Example 2 and minimum system design. 


= 1 - (1 - 0.8) 2 = 0.96 

(7.36a) 

= Rj- Ri 

(7.36b) 

= 1 - (1 -0.9) 2 = 0.99 

(7.36c) 


(7.36d) 


Again, the minimum system design is a significant step toward an optimized 
system design. The product of the 5 parallel subsystems yields a system reli¬ 
ability of 0.8671. The reliability has been raised from 0.41472 to 0.8671. Since 
24 cost units are consumed by the minimum design, 12 are left; these will buy 
up to 6 redundant elements for the first 3 elements or up to 4 for the last 2. 
An upper bound on the number of cases to be considered is7x7x7><5 
x 5 = 8,575 cases—a great reduction from the initial estimate of 177,957, but 
much larger than the 31 cases that really must be calculated once upper bounds 
are added. The next section discusses how we may rapidly find the remaining 
cases that need to be enumerated. 
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7.8.3 Upper Bounds 

In the previous section, we showed that the lower-bound procedure greatly 
decreased the number of cases that must be evaluated if enumeration is to be 
used. In this section, we show that this number is further reduced by the impo¬ 
sition of upper bounds that are related to the resource constraint, which has 
been modeled as cost. However, the procedure would be the same for another 
if a single constraint (such as volume or weight) were involved. The case of 
multiple constraints is discussed later. 

We begin by discussing the rare case in which the lower bound that yields 
the minimum design meets or exceeds the system goal. For example, suppose 
that a system reliability R s > R s , can be achieved by expending 90% of the cost, 
c 0 - In such a case, we wish to ask how much better can we make the system 
if we spend a bit more and how much we save if we are content to accept 
a slightly smaller R s that does not exceed R g . The easiest way to formulate 
such a set of policies is to compute the resultant system reliabilities and costs 
for adding or deleting one parallel element for subsystem Xi, repeating the 
procedure for subsystem Xi, and so on. The design team and customer then 
examine this family of policies to determine which policy is to be pursued. 

In the more familiar case, R s < R g , and we wish to expend all or some of 
the remaining resource c a to improve the system reliability to meet the desired 
goal. We seek a more efficient procedure for achieving an optimum system than 
blind enumeration. We can use the minimum solution as a lower bound and 
add in the resource constraint to achieve an upper bound on the solution. The 
resource constraint forms upper bounds on the number of additional elements 
that can be allocated. For Example 1, the minimum system leaves 3 cost units, 
which allow up to 3 additional parallel elements for each subsystem. However, 
each time we allocate a unit to a subsystem, we expend 1 resource unit, and 
the number of units available to other subsystems is also reduced by 1. We 
call the allocation of these additional resources the augmentation policy, it is 
the addition of the best of the augmentation policies to the minimum design 
that results in the optimization policy. 

The use of a branching search tree (policy tree) is one way to illustrate 
how the upper bounds are computed and how they constrain the solution to 
a small number of cases. We start with the minimum system design that is 
the result of the lower bounds and use the remaining constraint to generate 
a set of augmentation policies from which we select the optimum policy or, 
as discussed previously, one of the suboptima close to the true optimum. We 
use Example 1 to illustrate the procedure. The minimum design absorbs 13 
cost units, leaving 3 additional units. Thus the number of redundant elements 
is bounded from below by the minimum system design and from above by at 
most 3 additional units, yielding 2 < ri\ < 5; 4 < 112 < 7; and 1 <n^ < 10. One 
can improve on these bounds by applying the upper bounds as computation 
of the cases proceeds. The easist way to accomplish this is to compute the 
minimum design and allocate the remaining resource in an incremental manner. 
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The incremental policy is implemented in the following manner. Consider 
the alternatives; one could expend all the 3 incremental units on element 1, for 
instance, resulting in an augmentation policy An \ = 3, A//i = 0, and An] = 0, or 
one could use incremental notation, resulting in an incremental policy (3, 0, 0), 
with c = 16. Computation of this policy’s reliability yields R = 0.8602. Another 
alternative for element 1 is (2, -), with c = 15. One cost unit would be left to 

spend on element 2 or element 3. These two policies would be denoted as (2, 1, 
0) and (2, 0, 1) using the incremental notation and yield reliabilities of 0.8885 
and 0.8830. These three policies discussed above, as well as the seven other pos¬ 
sible ones, are shown in the search tree of Fig. 7.6. Branches radiating from the 
start node at the top of the tree represent the number of additional components 
(in addition to the minimum solution) assigned to element 1. The second level 
displays the incremental choices for element 2, and the third level displays the 
incremental choices for element 3. Inspection of Fig. 7.6 shows that the maxi¬ 
mum reliability occurs for augmentation policy (1, 1, 1), the center path in the 
diagram, which corresponds to a total solution of ri\ = 2 +1, «2 = 4 + 1, and 
= 7 + 1 at a cost of 16 and a reliability of 0.9098. Of course, the other solutions 
denoted by augmentation policies (0, 2, 1) and (0, 1, 2) with reliabilities 0.9068 
and 0.9087 are very close; one of these could be chosen as the policy of choice 
based on other factors. Other possibilities are to use standby reliability for some 
of the systems, especially in the case of element 3, which has a large number of 
parallel units. In some cases, we may not be able to reach the system goal, and 
either the goal must be relaxed or the cost budget must be increased. 

7.8.4 An Algorithm for Generating Augmentation Policies 

The basic approach is simple: the lower-bound solution for the minimum sys¬ 
tem design is the starting point. The resources for the minimum system design 
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TABLE 7.4 An Algorithm that Solves for the Minimum System Design and the 
Augmentation Policies for Example 1 


procedure (optimum reliability policy computation) 

{Three subsystems: 1, 2, 3} 

{Reliability of subsystems Rl, R2, R3) 

{Cost Constraint, C) 

{Reliability Goal, RG} 
input (Rl, R2, R3, C, RG) 

begin {Minimum System Design} 

Ml := ceiling [log(l - RG)/log(l - Rl)] 

M2 := ceiling [log(l - RG)/log(l - R2)] 

M3 := ceiling [log(l - RG)/log(l - R3)] 

RS := [(1 - (1 - R1)**M1)] * [(1 - (1 - R2)**M2)] * [(1 - (1 - R3)**M3)] 
PRINT (Ml, M2, M3, RS) 


end {Minimum System) 
begin {Augmentation Policy} 


CA := C - 

Ml - M2 - 

M3 

for I := 0 to CA 


for J := 

0 to CA - I 


for K 

:= 0 to CA - 

I - J 

Nl 

:= Ml + I 


N2 

:= M2 + J 


N3 

:= M3 + K 


RS 

:= [(1 - (1 - 

R1)”N1)] * [(1 


* [(1 - (1 

- R3)**N3)] 


PRINT (Nl, N2, N3, RS) 

end K 


end J 
end I 

end {Augmentation} 
end {Procedure} 


Note: The control statements have their usual meanings: The assignment operator is denoted by 
: =and the ceiling (x) function is the smallest integer that is greater than or equal to x. The symbol 
** means “raise to the power of” and * means “multiplied by.” 


are subtracted from the resource budget to obtain the resources available for 
augmentation. All possible augmentation policies are generated for element 1, 
along with the concomitant reduction in augmentation resources. For each of 
the policies for element 1, the remaining augmentation resources are used for 
element 2 to form the second step of the policies. This process is continued 
for the rest of the elements. Since the augmentation resources quickly decrease 
to 0 for many of the policies, the number of combinations to be considered is 
greatly reduced as the process continues. Once an augmentation policy is com¬ 
pleted, the reliability is calculated and the information is listed in a table (or on 
a search tree for smaller problems). A choice is then made among the policies 
yielding high reliabilities. 

A simple algorithm is given in Table 7.4 that solves for the minimum sys- 
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TABLE 7.5 Results of the Algorithm for Computing the 
Augmentation Policies for Example 1 (Optimum Solutions) 


Minimum System Design 

Ml 

M2 

M3 

RS 

2 

4 

7 

0.8409362 

Optimum Policies 

= Minimum Design + Augmentation Policies 

N1 

N2 

N3 

RS 

2 

4 

10 

0.8905201 

2 

5 

9 

0.9087401 

2 

6 

8 

0.9067561 

2 

7 

7 

0.8899909 

3 

4 

9 

0.896632 

3 

5 

8 

0.9098224 

3 

6 

7 

0.9002588 

4 

4 

8 

0.8830077 

4 

5 

7 

0.8885192 

5 

4 

7 

0.860227 


tem design and the augmentation policies for Example 1 of Section 7.7.2. The 
algorithm is written in a pseudocode form similar to that given in Appendix 2 
of Rosen [1999]. It generates the minimum system design (2, 4, 7) and the 10 
augmentation policies. In the augmentation policies, the I-loop allocates 0 to 
3 resource units to subsystem 1 and the J-loop allocates the (3 — I) remaining 
units to subsystem 2. Once resources are allocated to subsystems 1 and 2, the 
amount of resources left for subsystem 3, K, is (3 - I - J). Thus, the I-loop 
takes on the values [I =(0), (1), (2), (3)]. The J-loop takes on the values J =(3 
- I), which generates the pairs [I, J =(0, 0), (0, 1), (0, 2), (0, 3), (1, 0), (1, 1), 
(1, 2), (2, 0), (2, 1), (3, 0)]. Lastly, the K-loop assigns the remaining variable 
K = [3 — (I + J)], which completes the 10 triplets [I, J, K = (0, 0, 3), (0, 1, 2), 
(0, 2, 1), (0, 3, 0), (1, 0, 2), (1, 1, 1), (1, 2, 0), (2, 0, 1), (2, 1, 0), (3, 0, 0)]. 

Execution of a program based on the algorithm in Table 7.4 enumerates the 
10 augmentation policies discussed previously. The results, given in Table 7.5, 
agree with the search tree in Fig. 7.6. 

The concept of an optimum is clearly defined mathematically, but in terms 
of an engineering design, a family of near-optimum solutions is preferred along 
with the optimum one. Since we have a simple algorithm (program), it is easy 
for the designer to explore such a family of solutions. Suppose we ask what 
reliability could be achieved if one decided to shave the cost from 16 units to 15 
units. Substituting an augmentation budget of 2 instead of 3 in the loops given 
in Table 7.4 yields the solutions in Table 7.6. Now our comparison begins: 
Is the solution of (2, 5, 8), with a reliability of 0.8923632 and a cost of 15, a 
good substitute for the solution of (3, 5, 8), with a reliability of 0.9098224 and 
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TABLE 7.6 Reliability of Various Designs for Example 1 (Optimum 
Solutions) In Which the Maximum Cost Is Reduced to 15 Units 


Minimum System Design 


Ml 

M2 

M3 

RS 

2 

4 

7 

0.8409362 

Optimum Policies 

= Minimum Design + Augmentation Policies 

N1 

N2 

N3 

RS 

2 

4 

9 

0.8794260 

2 

5 

8 

0.8923632 

2 

6 

7 

0.8829831 

3 

4 

8 

0.8804733 

3 

5 

7 

0.8859690 

4 

4 

7 

0.8598573 


a cost of 16? Suppose we can afford a budget of 17 units. The optimum, with 
an increased budget of 17, achieves a reliability of 0.9265198 by using policy 
(3, 5, 9). Is it worth the extra cost unit to raise the reliability? Only a design 
review that includes the system designer, the customer, and possibly various 
practical considerations can be used to decide such issues. 

We repeat the procedure in Table 7.6 for Example 2. An algorithm for the 
solution of Example 2 is given in Table 7.7. Note in the table that the loop- 
control end point is adjusted for the resources allocated in outer loops (e.g., 
for L := 0 to (CA— I — J - K)/C4), and note that the end point is divided by 
the item weight, C4, so that L is incremented in multiples of C4. 

The results of executing a program corresponding to the algorithm of Table 
7.7 are given in Table 7.8. (The running time of the program was one or two 
seconds on a Pentium 400 MHz personal computer.) The minimum system 
design requires 2 elements for each subsystem, expends 24 units of resources, 
and achieves a reliability of 0.8671. The 31 policies generated by the algorithm 
are listed in Table 7.8 in descending order of reliability. 

Note that some of the policies are dominated by other policies; for example, 
policy 14, which is (4, 5, 3, 2, 2) and requires 36 resource units, is dominated 
by policy 1, which is (4, 4, 4, 2, 2) and also uses the entire 36 resource units to 
achieve a higher reliability. A study of the table shows that policy 1 dominates 
policies 2, 9, 10, 11, 12, 13, 14, 18, 19, 22, 23, and 24. Similarly, policy 15 
dominates policies 25, 26, and 27. Thus out of the 31 policies, a total of 15 are 
dominated, leaving 16 to represent “good solutions” that should be considered. 

Further inspection of Table 7.8 shows that there are many good policies that 
yield suboptima. For example, policy 31 satisfies the minimum requirement RS 
> 0.95 with fewer resources—30 rather than the 36 budgeted. Such calculations 
should result in a conference between the design leader, the management, and 
the customer to answer such questions as the following: 
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TABLE 7.7 An Algorithm That Solves for the Minimum System Design and the 
Augmentation Policies for Example 2 


procedure (optimum reliability policy computation) 

{Five subsystems: 1, 2, 3, 4, 5} 

{Reliability of subsystems Rl, R2, R3, R4, R5} 

{Cost Constraint, C: individual costs, Cl, C2, C3, C4, C5} 

{Reliability Goal, RG} 

input (Rl, R2, R3, R4, R5, C, RG, Cl, C2, C3, C4, C5) 
begin {Minimum System Design) 

Ml := ceiling [log(l - RG)/log(l - Rl)] 

M2 := ceiling [log(l - RG)/log(l - R2)] 

M3 := ceiling [log(l - RG)/log(l - R3)] 

M4 := ceiling [log( 1 - RG)/log(l - R4)] 

M5 := ceiling [log(l - RG)/log(l - R5)] 

RS := [(1 - (1 - R1)**M1)] * [(1 - (1 - R2)**M2)] * [(1 - (1 - R3)**M3)] 
* [(1 - (1 - R4)**M4)] * [(1 - (1 - R5)**M5)] 

CS := Ml * Cl + M2 * C2 + M3 * C3 + M4 * C4 + M5 * C5 
PRINT (Ml, M2, M3, M4, M5, CS, RS) 
end {Minimum System) 
begin {Augmentation Policy) 

CA := C - Ml - M2 - M3 - M4 - M5 
for I := 0 to CA/C1 

for J := 0 to (CA - I)/C2 
for K := 0 to (CA - I - J)/C3 
for L := 0 to (CA - I - J - K)/C4 


for M 

:= 0 to (CA 

- I - J - K 

- L)/C5 



Nl 

= Ml + I 





N2 

= M2 + J 





N3 

= M3 + K 





N4 

= M4 + L 





N5 

= M5 + M 





RS 

= 1(1 - (1 - 

R1)**N1)] * 

1(1 - (1 - 

R2)** 

N2)] 


* [(1 - (1 

- R3)**N3)] 

* [(1 - (1 

- R4) 

**N4)] 


* [(1 - (1 

- R5)**N5)] 




CS 

= 2 * Nl + 

l * N2 + 2 * 

N3 + 3 * N4 + 3 

* N5 


PRINT (Nl, N2, N3, N4, N5, CS, RS) 

end M 
end L 
end K 
end J 
end I 

end {Augmentation} 
end (Procedure) 


1. Is a reliability of 0.9754 that uses resources of 36 units significantly better 
than one of 0.95677 that uses resources of 30 units? 

2. What would be the cost reduction for a system that uses 30 resource 
units? 
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TABLE 7.8 Parallel Redundancy Optimum and Suboptimum Solutions for 
Example 2 


Rank 

«i 

«2 

ft3 


«5 

Cost 

Reliability 

1 

4 

4 

4 

2 

2 

36 

0.97540 

2 

3 

3 

3 

3 

3 

36 

0.97424 

3 

3 

3 

4 

2 

3 

35 

0.97169 

4 

3 

3 

4 

3 

2 

35 

0.97169 

5 

3 

4 

3 

2 

3 

35 

0.97169 

6 

3 

4 

3 

3 

2 

35 

0.97169 

7 

4 

3 

3 

2 

3 

35 

0.97169 

8 

4 

3 

3 

3 

2 

35 

0.97169 

9 

3 

4 

5 

2 

2 

36 

0.97039 

10 

3 

5 

4 

2 

2 

36 

0.97039 

11 

4 

3 

5 

2 

2 

36 

0.97039 

12 

5 

3 

4 

2 

2 

36 

0.97039 

13 

5 

4 

3 

2 

2 

36 

0.97039 

14 

4 

5 

3 

2 

2 

36 

0.97039 

15 

4 

3 

4 

2 

2 

34 

0.96915 

16 

3 

4 

4 

2 

2 

34 

0.96915 

17 

4 

4 

3 

2 

2 

34 

0.96915 

18 

3 

3 

3 

4 

2 

36 

0.96633 

19 

3 

3 

3 

2 

4 

36 

0.96633 

20 

3 

3 

3 

3 

2 

33 

0.96546 

21 

3 

3 

3 

2 

3 

33 

0.96546 

22 

3 

3 

6 

2 

2 

36 

0.96442 

23 

6 

3 

3 

2 

2 

36 

0.96442 

24 

3 

6 

3 

2 

2 

36 

0.96442 

25 

5 

3 

3 

2 

2 

34 

0.96417 

26 

3 

3 

5 

2 

2 

34 

0.96417 

27 

3 

5 

3 

2 

2 

34 

0.96417 

28 

4 

3 

3 

2 

2 

32 

0.96294 

29 

3 

3 

4 

2 

2 

32 

0.96294 

30 

3 

4 

3 

2 

2 

32 

0.96294 

31 

3 

3 

3 

2 

2 

30 

0.95677 


3. If 30 resource units are used for reliability purposes, can the additional 
budgeted 6 resource units be used for something else of value in the 
system design? 

To further answer these questions, additional studies should be attempted with 
perhaps 32 or 34 resource units, the results of which should be used in the 
study. The major result demonstrated in this section is that the use of upper and 
lower bounds and modern, relatively fast personal computers allows a designer 



BOUNDED ENUMERATION APPROACH 365 


the luxury of computing a range of design solutions and comparing them; in 
general, the more complex methods discussed later in this chapter are seldom 
needed. Using the results of the discussion of availability in Section 7.4.6, it 
is easy to adapt any of the foregoing algorithms to availability apportionment 
by substituting probabilities in the algorithms that represent unit availabilities 
rather than reliabilities. 


7.8.5 Optimization with Multiple Constraints 

The preceding material in this chapter has dealt with a single constraint and has 
used cost to illustrate the constraint. Sometimes, however, there are multiple 
constraints—cost, volume, and weight, for instance. Of obvious importance is 
the use of these three constraints in satellites, spacecraft, and aircraft. Without 
loss of generality, we can assume that there are three constraints: cost, volume, 
and weight (c, v, and w) and that the constraints (given in the forthcoming 
equations) are similar to those of Eq. (7.4). Clearly, these constraints as well 
as the following equations can represent other variables and/or can be extended 
to more than three constraints. 


k 

c = X rijCi (7.37a) 

i = 1 
k 

v = E riiVi (7.37b) 

i = 1 
k 

w = Y rijWj (7.37c) 

i = i 

Generally, optimization techniques such as dynamic programming become 
much more difficult when more than one variable is involved. Since we are 
dealing with discrete optimization and enumeration, however, the extra work 
of multiple constraints is modest. First of all, the computation of the minimum 
system design is not affected by the constraints; thus lower bounds are com¬ 
puted as in the case of a single constraint (cf. Section 7.6.2). Once the minimum 
design (lower bound) is obtained, the values are substituted in Eqs. (7.37a-c) 
and the remaining values of the constraints are computed for the augmenta¬ 
tion phase. In some cases, the minimum system design exceeds one or more 
of the constraints and the reliabilty goal and constraints are incompatible (one 
can call such a situation an ill-formulated problem). The only recourse in the 
case of an ill-formulated problem is to have a high-level design review with 
all members of the designer’s and the customer’s senior management present 
to change the requirements so that the problem is solvable. 

Assume that the minimum system design still leaves some values of all the 
constraints for the augmentation phase. The constraints are still used to com- 
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pute the upper bounds; however, we now have more than one upper bound. For 
the case under discussion, we have three upper bounds—one governed by cost, 
one by weight, and one by volume—that result in three values of n: (/?,, n' h n'). 
The bound we choose is the minimum value of the three bounds, that is, the 
minimum value of (n,, n', n") [Rice, 1999]. Computation of the augmentation 
policy proceeds in the same manner as discussed in Section 7.6.3; however, 
at each stage, three upper bounds are computed, and the minimum is used in 
each case. Once the augmentation policy is obtained, the system reliability is 
computed. If the reliability goal cannot be obtained, a high-level design review 
is enacted. The designer should compute beforehand some alternative designs 
for presentation that violate one or more of the constraints but still achieve the 
goals. 


7.9 APPORTIONMENT AS AN APPROXIMATE OPTIMIZATION 
TECHNIQUE 

The bounded solutions of the previous section led to a family of solutions that 
included the maximum reliability combination. The apportionment techniques 
discussed in Section 7.6 can be viewed as an approximate solution. In this 
section, we explore how these solutions compare with the optimum solutions 
of the previous section. 

We begin by considering Example 1 given in Section 7.7.2. The system goal 
R g = 0.9, and by using Eq. (7.15b) we obtain a goal of 0.9655 for each of the 
three subsystems. We can determine how many parallel components are needed 
for each subsystem if we use the reliability goal of 0.9655 and each subsystem 
reliability (0.85, 0.5, 0.3) and substitute into Eq. (7.31b). The results for the 
subsystem are 


«i = log(l - R g )/log(l - r 0 (7.38a) 

m = log(l - 0.9655)/log(l - 0.85) = 1.77 (7.38b) 

m = log(l - 0.9655)/log(l - 0.5) = 4.86 (7.38c) 

n 3 = log(l - 0.9655)/log(l - 0.3) = 9.44 (7.38d) 


This represents a solution of m, n 3 , n 3 = 2, 5, 10, using 17 cost units and 
exceeding both the reliability goal and the cost budget. One could then try 
removing one unit from each of the subsystems to arrive at approximate solu¬ 
tions. If we remove one unit from subsystem 2 or 3, we obtain the 16-unit 
solutions m, ri 2 , n 3 =2, 4, 10 and m, n 3 , n 3 = 2, 5, 9, which correspond to the 
optimum designs given in Table 7.5 (rows 1 and 2). If we remove one unit 
from subsystem 1, we obtain solution n\, m, n 3 “ 1, 5, 10, corresponding to a 
reliability of 0.8002 that is clearly inferior to all the 10 solutions in Table 7.5. 

We now consider the apportionment solutions for Example 2 given in Sec¬ 
tion 7.7.2. The system goal is R g = 0.95, and by using Eq. (7.15b), we obtain 
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a goal of 0.9898 for each of the five subsystems. We can determine how many 
parallel components are needed for each subsystem to meet the reliability goal 
of 0.9898 with each subsystem reliability (0.8, 0.8, 0.8, 0.9, 0.9). Substitution 
into Eq. (7.31b) yields the desired results that need to be calculated only for 
the subsystem values of 0.8 and 0.9; these are 

n\ = log(l - i? g )/log(l - n) (7.39a) 

nt, 2> 3 = log(l - 0.9898)/log(l - 0.8) = 2.85 (7.39b) 

714,5 = log(l - 0.9898)/ log(l - 0.9) = 1.99 (7.39c) 

Thus rounding up represents a solution of n\, ni, 713 , 774 , 775 =3, 3, 3, 2, 2. 
Since the costs are 2, 2, 2, 3, 3 units per subsystem, respectively, this appor¬ 
tioned solution expends 30 cost units and yields a reliability of 0.9568. This 
is the last policy in Table 7.8. We conclude that the simplest equal apportion¬ 
ment method is a good approximation, and since it only takes a few minutes 
with paper, pencil, and calculator, it is a valuable check on the results of the 
previous section. 


7.10 STANDBY SYSTEM OPTIMIZATION 

In principle, the optimization of a standby system is the same procedure as 
that of a parallel system, but instead of the reliability expression for n items in 
parallel given in Eq. (7.3), the expression given in Eqs. (7.5) and (7.6) is used. 
Because Eq. (7.6) is a series, the simple solution for the number of elements in 
the minimum system design given in Eqs. (7.30) and (7.31) is not applicable. 
A slightly more complicated solution for a standby system involves the evalua¬ 
tion of Eq. (7.6) for increasing values of k until the right-hand side exceeds the 
reliability goal R g . This is a little more complicated for paper-pencil-and-cal- 
culator computation, but the complexity increase is insignificant when a com¬ 
puter program is used. Another approach is to use cumulative Poisson tables 
or graphs to solve Eq. (7.6), the Chebyschev bound, or the normal approxi¬ 
mation to the Poisson (these later techniques are explained and compared in 
Messinger [1970]). The reader should note that the techniques for a standby 
system also apply to a system with spares. We assume that a standby system 
switches in the standby component quickly enough for the system performance 
to not be affected. In other systems, a set of spare components is kept near an 
operating system that has self-diagnostics to rapidly signal the failure. If the 
replacement of a spare is rapid (e.g., opening a panel and putting in a printed 
circuit board spare), the system would not be down for a significant interval 
(this is essentially a standby system). If the time to switch a standby system is 
long enough to interrupt system performance or if the downtime in replacing 
a spare is significant, we must treat the system as an availability problem and 
formulate a Markov model. 
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One can use Example 1 of Section 7.7.2 to illustrate a standby system reli¬ 
ability optimization [Shooman, 1994]. Since standby systems are generally 
more complex than parallel ones because of the failure detection and switching 
involved, we assume that each element costs 1.5 units rather than 1 unit. Fur¬ 
thermore, we equate the probability of no failures (x = 0) for the Poisson, e 
to the reliabilities of each unit and solve for the expected number of failures, 
li: /n = ln(0.85) = 0.1625189; M 2 = ln(0.5) = 0.6931471; and /x 3 = ln(0.3) = 
1.2039728. Substitution into Eqs. (7.5) and (7.6) yields 


e ^'(1 + mi + Mi/2! + •••+) 

> 0.90 for two terms 0.85(1 +0.1625189) 

= 0.9881 >0.9 (7.40a) 

e M (1 + M 2 + M5/2! + •••+) 

> 0.90 for three terms 0.5(1 + 0.6931471 + 0.2402264) 

= 0.9667 > 0.9 (7.40b) 

e M (1 + M3 + m5/2! + •••+) 

> 0.90 for four terms 0.3(1 + 0.12039728 + 0.7247752 + 0.2908735) 

= 0.9659 > 0.9 (7.40c) 

Thus, the minimum values for standby system reliability are ri\ =2, n 2 = 
3, n 2 = 4. The minimum system cost is 9 x 1.5 = 13.5 and the reliability is 
0.9881 x 0.9667 x 0.9659 = 0.9226. Since this exceeds the reliability goal, 
this is the optimum solution, and the augmentation policy phase is not needed. 
If augmentation had been required, we could use an algorithm similar to that 
of Table 7.4; however, instead of equations for Ml, M2, and M3, do while 
R < Ru loops that increment n are used. Similarly, the RS:= equation becomes a 
product of the series expansions for the Poisson that are computed by for loops 
in a manner similar to that of Eqs. (7.40a-c). If the assumed cost of 1.5 for 
a standby element is accurate, and if there are no other overriding factors, the 
standby system would be preferred to that of any of the parallel system policies 
of Table 7.5 since the resource cost is less and the reliability is higher. 

It is also possible to adapt the foregoing optimization techniques to an r- 
out-of-n system design; see Shooman [1994, p. 946]. 

One should not forget that reliability can be improved by component 
improvement as an alternative to parallel or standby system redundancy. In 
general, there are extra costs involved (development and production), and typi¬ 
cally such an improved design begins by listing all the ways in which the 
element can fail in the order of frequency. Design engineers then propose 
schemes to eliminate, mitigate, or reduce the frequency of occurrence. The 
design changes are made, sometimes one at a time or a few at a time, and the 
prototype is tested to confirm the success of the redesign. Sometimes overstress 
(accelerated) testing is used in such a process to demonstrate unknown failure 


Team-Ffy * 


OPTIMIZATION USING A GREEDY ALGORITHM 369 


modes that must be fixed. A model comparing the costs of improved design 
with the costs of parallel redundancy is given in Shooman and Marshall [1993 
and 1994, p. 947], 


7.11 OPTIMIZATION USING A GREEDY ALGORITHM 

7.11.1 Introduction 

If one studies the optimum solutions of various optimization procedures, we 
find that the allocation of parallel subsystems tends to raise all the subsystems 
to nearly the same reliability. For example, consider the optimum solutions 
given in Table 7.5. The subsystem reliabilities start out (0.85, 0.5, 0.3) and the 
minimal system design (with 2, 4, and 7 parallel systems) yields upon substitu¬ 
tion into Eq. (7.3), giving (0.9775, 0.9375, 0.9176), and the optimum solution 
of 3, 5, 8 results in reliabilities of (0.9966, 0.9688, 0.9424). This is one of 
the reasons that the equal apportionment approximation gave reasonably good 
results, leading one to a heuristic procedure for optimization. If one starts with 
the initial design or, better, with the minimal system design, one can allocate 
additional parallel systems to the subsystems by computing which allocation to 
subsystem 1, 2, or 3 will produce the largest increase in reliability. Such an allo¬ 
cation is made and the computations are repeated with the new parallel system 
added. Based on these new computations, a second parallel system is added, 
and the procedure is repeated until the entire resource has been expended. This 
procedure generates an augmentation policy that, when added to the minimal 
system design, generates a good policy. 

7.11.2 Greedy Algorithm 

The foregoing algorithm that makes an optimal choice at each step is often 
referred to as a “greedy” algorithm [Cormen, Chapter 17]. Starting with the 
minimum system design, we compute the increase in reliability obtained by 
adding a single element to each subsystem. In the case of Example 1, Fig. 
7.4(b) shows that the minimum system design requires m = 2, m = 4, and nj, = 
7, yielding a reliability R sm = 0.9775 x 0.9375 x 0.9176 = 0.8409. The addition 
of one element to /q raises the reliability of this subsystem to 0.996625 and R sm 
to 0.8573, which represents a reliability increment A R of 0.0164. Repeating 
this process for an increase in rq from 4 to 5 results in raising subsystem 2 
to a reliability of 0.96875 and R sm to 0.8689, which represents a reliability 
increment A R of 0.0280. Increasing rq to 8 yields a subsystem reliability of 
0.9424, and R sm increases to 0.8636, which represents a reliability increment 
A R of 0.0227. Raising subsystem 2 from 4 to 5 parallel elements produces the 
largest A R\ thus the first stage of the greedy algorithm yields the following: 


1. Stage 0, minimum system design: 
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ii\ = 2, n 2 = 4, «3 = 7, 

R sm = 0.9775 x 0.9375 x 0.9176 = 0.8409 

2. Stage 1, add one to ny. 

77 1 = 2, 772 = 5, 773 = 7, 

i?, m = 0.9775 x 0.96875 x 0.9176 = 0.86875 

Continuing the greedy process yields the following: 

1. Stage 2, add one to 773 : 

771 = 2, 772 = 5, 773 = 8 , 

= 0.9775 x 0.96875 x 0.9424 = 0.8924 

2. Stage 3, add one to m: 

771 = 3, 772 = 5, 773 = 8 , 

tf. sm = 0.996625 x 0.96875 x 0.95424 = 0.909868 

When we compare the solution of stage 3 with Table 7.5, we see that they 
both have reached the same policy and the same optimum reliability (within 
round-off errors). The greedy algorithm always yields a good solution, but it 
is not always the optimum (cf. Section 7.11.4). 

7.11.3 Unequal Weights and Multiple Constraints 

There is something special about Example 1—it is that all the weights are 
equal. If we consider Example 2, where there are unequal weights, it may not 
be fair to compare the reliability increase, A R, achieved through the adding of 
one additional parallel component. For example, a component with a cost of 2 
should be compared to adding two components with costs of 1 each. Thus, a 
better procedure in implementing the greedy algorithm is to compare values of 
AR/AC as the single constraint of cost. When there are multiple constraints, 
say, c, w, and v, then the comparison should be made based on some function 
of c, w, and v;f(c, w, v), that is, use AR/Af(c, w, u) as the comparison factor. 
One possible function to use is a linear combination of the fraction of the 
constraints expended. If m represents the stage of the augmentation policy, then 
we would view the ratio C m /C a as the fraction of the augmentation cost that has 
been allocated. Thus, if at the first stage we allocate 20% of the augmentation 
cost, then C m /C a = 0.2 and the inverse ratio C a /C m = 5. If we let f(c, w, u) = 
k\{C m /C a ) + k 2 (W m /W a ) + k 3 (V m /V a ), and also let U, =k 2 = k 3 = l, then the 
constraint with the most available capacity has a stronger influence. Obviously, 
there are many other good choices for /(c, w, v ). 
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7.11.4 When Is the Greedy Algorithm an Optimum? 

The greedy algorithm seems like a fine approach to generating a sequence of 
good policies that lead to an optimum or a good semioptimum policy. The main 
question is when the greedy solution is an optimum and when it is a semiop¬ 
timum, a question that has been studied by many [Barlow, 1965, 1975]. The 
geometrical model discussed in Section 7.2 can be used to explain optimization 
techniques. Suppose that the reliability surface has two spires: one at xiyi that 
reaches a reliability peak of 0.98 and another at X 2}>2 that reaches a reliability 
peak of 0.99. If we start the greedy algorithm at the base of spire one, it is 
possible that it will reach a reliability maximum of 0.98 rather than 0.99. There 
are similarities between the greedy algorithm and the gradient algorithm for 
continuous functions [Hiller, 1974, p. 729]. Recent work has focused on devel¬ 
oping a theory (called the matroid theory ) that provides the basis for greedy 
algorithms [Cormen, p. 345]. However, as long as the upper and lower bounds 
discussed previously provide a family of solutions that includes the optimum 
as well as a group of good suboptimum policies, and if the computer time is 
modest, the use of the greedy algorithm is probably unnecessary. 

7.11.5 Greedy Algorithm Versus Apportionment Techniques 

We can understand how the apportionment algorithm reaches an approximate 
solution if we compare it with a greedy approximation to exact optimization. 
The greedy approximation adds a new redundant component at each step, 
which yields the highest gain in reliability on each iteration. The result is 
to “spread the redundancy around the system” in a bottom-up fashion. The 
apportionment process also spreads the redundancy about the system, but in a 
top-down fashion. In general, the two techniques will yield a different set of 
suboptima; most of the time, both will yield good solutions. The apportion¬ 
ment approach has a number of advantages, including the following: (a) it fits 
in well with the philosophy of system design, which is generally adopted by 
designers of large systems; (b) its computations will in general be simpler; and 
(c) it may provide more insight into when the suboptimal solutions it generates 
are good. 


7.12 DYNAMIC PROGRAMMING 
7.12.1 Introduction 

Dynamic programming provides a different approach to optimization that 
requires fewer steps than exhaustive enumeration and always leads to an opti¬ 
mal policy. The discussion of this section is included for completeness, since 
the author believes that the application of lower bounds to obtain a minimum 
system design and the subsequent use of upper bounds to obtain an augmenta¬ 
tion policy both require less effort but still yield the optimum. The incremental 
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reliability method is a competitor to the bounding techniques, but unless one 
makes a careful study, it is not possible to be sure that this method does indeed 
yield an optimum. 

Dynamic programming is based on an optimality principle established by 
Bellman [1957, 1958], who states, “An optimal policy has the property that 
whatever the initial state and initial decision are, the remaining decisions must 
constitute an optimal policy with regard to the state resulting from the first 
decision.” Clearly, this is a high-level principle that can apply to a large vari¬ 
ety of situations. A large number of examples that describe how dynamic pro¬ 
gramming can be applied to various situations appear in Hiller [1974, Chapter 
6]. The best way to understand its application to reliability optimization is to 
apply it to a problem. 


7.12.2 Dynamic Programming Example 

The following example used to illustrate dynamic programming is a modifica¬ 
tion of Example 1 of Section 7.7.2. 

Example 3 (Modification of Example 1): The initial design of a system yields 
3 subsystems at the first level of decomposition. The system reliability goal, 
R g , is 0.8 for a given number of operating hours. The initial estimates of the 
subsystem reliabilities are R\ = 0.85, R 2 t = 0.5, and R 2 = 0.3. Parallel redundancy 
is to be used to improve the initial design so that it meets the reliability goal. 
The constraint is cost; subsystem 1 is assumed to cost 2 units and subsystems 
2 and 3 to cost 1 unit each. The total cost budget is 16 units. 


7.12.3 Minimum System Design 

Dynamic programming can deal with the optimization problem as stated. How¬ 
ever, one should take advantage of the minimum system design procedures 
(lower bounds) to reduce the size of the problem. Thus the minimum design 
is computed and dynamic programming is used to solve for the augmentation 
policy. The minimum system design is computed in a manner similar to that 
of Eqs. (7.35a-d). 


ni = logfl - 0.8)/(l - 0.85) = 0.848 (7.41a) 

n 2 = log( 1 - 0.8)/(l - 0.5) = 2.322 (7.41b) 

n 3 = log( 1 - 0.8)/(l - 0.3) = 4.512 (7.41c) 

Thus the minimum system design consists of one subsystem 1, three subsystem 
2, and five subsystem 3. The cost of the minimum design is C = 1 x 2 + 3 x 
1 + 5 x 1=10, and the cost available for the augmentation policy is AC = 16 
- 10 = 6. The reliability of the minimum system design is 
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Rsm = [1 - (1 - 0.85)] X [1 - (1 - 0.5) 1 2 3 * ] X [1 - (1 - 0.3) 5 ] 

= (0.85) x (0.875) x (0.83193) = 0.6187 (7.42) 


7.12.4 Use of Dynamic Programming to Compute the Augmentation 
Policy 

Thus we now wish to use dynamic programming to determine what is the 
best augmentation policy from which to raise the reliability 0.6187 to 0.8 by 
expending the remaining six cost units. Dynamic programming for this prob¬ 
lem consists of two phases: I and II. Phase I is used to construct a series of 
tables that correspond to the best solution for cost allocation for various com¬ 
binations of the subsystems. The first table considers the first subsystem only. 
The second table corresponds to the best cost allocation for the first and second 
subsystem; its construction uses information from the first table. A third table 
is constructed that gives the best allocation for the third system based on the 
second table, which depicts the best allocation for the first and second subsys¬ 
tems. For phase II, certain features of the three tables of phase I are combined 
to construct a new table that displays cost allocation per subsystem and also 
the resulting reliability optimization. A “backtracking” solution-procedure is 
used to compute the optimal policy for the cost constraint. The solution proce¬ 
dure automatically allows backtracking to compute optimization solutions for 
smaller cost constraints. The description of these procedures will be clearer 
when they are applied to Example 3. 

We begin our discussion by constructing the first table of phase I of the 
solution, which is found in Table 7.9 and labeled as “Table 1.” The first col¬ 
umn in Table 1 is the amount of cost constraint allocated to subsystem 1. The 
maximum allocation is 6 cost units; the minimum, 0 cost units. The increments 
between 0 and 6 are sized to be equal to the greatest common divisor of all the 
subsystem costs, gcd (Ci, C 2 , C 3 ). In the case of Example 3, this is the gcd( 2, 

1, 1) = 1. Thus the first column in Table 1 comprises the cost allocations 0, 1, 

2, 3, 4, 5, and 6 . 

The details of constructing Table 1 are as follows: 

1. Consider the bottom line of Table 1. This table considers only the allo¬ 
cation of cost to buy additional parallel units for subsystem 1. If 0 cost is 
allocated to subsystem 1 for the augmentation policy, then no additional 
components can be allocated above the single subsystem of the minimal 
system design, and the optimal reliability is the same as the minimum 
system design—that is, 0.6187. 

2. Because subsystem 1 costs 2 units, no additional units can be purchased 
with 1 cost unit, and the solution is the same as the 0 cost allocation. 

3. If we increase the cost allocation to 2 units, we can allocate 1 addi¬ 

tional unit to subsystem 1 for a total cost of 2 , from which the reliability 
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TABLE 7.9 Phase I Constraint Allocation Tables 


Table 1: Allocation Table for Subsystem 1 


ACost Constraint 

A«i Allocation 

Optimum Reliability 

6 

3 

0.7276 

5 

2 

0.7255 

4 

2 

0.7255 

3 

1 

0.7116 

2 

1 

0.7116 

1 

0 

0.6187 

0 

0 

0.6187 

Table 2 

: Allocation Table for Subsystems 1 and 2 

ACost Constraint 

A«2 Allocation 

Optimum Reliability 

6 

4 

0.8068 

5 

3 

0.8063 

4 

2 

0.7878 

3 

1 

0.7624 

2 

0 

0.7116 

1 

1 

0.6629 

0 

0 

0.6187 

Table 3: 

Allocation Table for Subsystems 1, 2, and 3 

ACost Constraint 

A «3 Allocation 

Optimum Reliability 

6 

2 

0.8689 

5 

2 

0.8409 

4 

1 

0.8086 

3 

0 

0.7624 

2 

0 

0.7116 

1 

0 

0.6629 

0 

0 

0.6187 


becomes R s = [1 - (1 - 0.85) 2 ] x (0.875) x (0.83193) = (0.9775) x 
(0.875) x (0.83193) = 0.7116. This solution also holds for 3 cost units. 

4. For 4 and 5 cost units, there can be an allocation of two subsystem 1 
units, from which the reliability becomes R s = [1 - (1 - 0.85) 3 ] x (0.875) 
x (0.83193) = (0.996625) x (0.875) x (0.83193) = 0.7255. 

5. Lastly, for an allocation of 6 units, the total number of subsystem 1 units 
is 1 + 3, from which the reliability becomes R s = [1 - (1 - 0.85) 4 ] x 
(0.875) x (0.83193) x (0.999493) x (0.875) x (0.83193) = 0.7276. 


The details of constructing Table 2 are as follows: 
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1. Consider the bottom line of Table 2. If there are 0 cost units allocated, 
then there can be no additional parallel elements for subsystem 2 and 
none for subsystem 1. Thus the reliability is the same as the bottom line 
in Table 1—that is, 0.6187. 

2. If there is 1 cost unit allocated in Table 2, we can allocate 1 additional 
unit to subsystem 2 or we can consult Table 1 to see the result of allocat¬ 
ing 1 cost unit to subsystem 1 instead. Since subsystem 1 is 2 cost units, 
there is no gain obtained by the allocation of 1 cost unit to subsystem 1. 
Therefore, the optimum is to allocate 1 additional element to subsystem 
2 (for a total of 4), from which the reliability is R s = (0.85) x [ I - ( | 
- 0.5) 4 ] x (0.83193) x (0.85) x (0.9375) x (0.83193) = 0.6629. Note 
that this is actually the optimum for subsystems 1 and 2, which is the 
meaning of Table 2. 

3. For a cost allocation of 2, we have three choices: (a) 1 additional element 
for Am and 0 for An \; (b) two additional elements for Am and 0 for An 1 ; 
and (c) 0 additional elements for Am and 1 for Am . Clearly, choice (b) 
is superior to choice (a), so for the optimum policy we need to compare 
choices (b) and (c). Note that for choice (c), we can obtain the achieved 
reliability by reading the appropriate row in Table 1, which indicates R s 
= 0.7116. For choice (2), we obtain R s = (0.85) x [1 - (1 - 0.5) 5 ] x 
(0.83193) = (0.85) x (0.96875) x (0.83193) = 0.6850. Thus choice (b) 
is superior, and we allocate 0 elements to Am- 

4. In the case of a cost constraint of 3, there are again three choices: (a) 

1 additional element for A 112 and 1 for An \; (b) 2 additional elements 
for An 2 and 0 for An j; and (c) 3 additional elements for Any and 0 for 
Ahi. Clearly, choice (c) is superior to choice (b). To compare choice (c) 
with choice (a), we say that it is a comparison of 1 additional element 
fosr h and 1 for m versus 1 additional element for 112 along with 2 more 
additional elements for m- However, we already showed that 1 for Hi 
is better that 2 for H 2 ; therefore, choice (a) is superior, from which the 
reliability is R s = [1 - (1 - 0.85) 2 ] x [1 - (1 - 0.5) 4 ] x [1 - (1 - 0.3) 5 
= (0.9775) x (0.9375) x (0.83193) = 0.7624. 

5. For the case of 4 units of incremental cost, there are again three choices: 
(a) 0 additional elements for AH 2 and 2 for Am; (b) 2 additional elements 
for Ah 2 and 1 for Am; and (c) 4 additional elements for Am and 0 for 
Am. From Table 1, we see that choice (a) yields a reliability of 0.7255, 
which is smaller than the previous allocation in Table 2; thus we should 
consider choice (b) or choice (c). The reliability for these two choices is 
given by R s = [1 - (1 - 0.85) 2 ] x [1 * (1 - 0.5) 5 ] x [1 - (1 - 0.3) 5 ] = 
(0.9775) x (0.96875) x (0.83193) = 0.7878, and R s =[1 - (1 — 0.85) *] 
x [1 - (1 - 0.5) 6 ] x [1 _ (1 - 0.3) 5 ] =(0.85) x (0.9844) x (0.83193) 
= 0.6961]. Thus choice (b) is superior. 

6. For the case of 5 cost units, one choice is 2 units for Hi and 1 for m, 
which gives a reliability of R s = [1 - (1 - 0.85) 3 ] x [1 - (1 - 0.5) 5 ] x [1 



376 


RELIABILITY OPTIMIZATION 


- (1 - 0.3) 5 ] =(0.996625) x (0.96875) x (0.83193) = 0.8032. Another 
choice is 1 unit of m and 3 units of n 2 , with a reliability of R s = [1 — (1 

- 0.85) 2 ] x [1 - (1 - 0.5) 6 ] x [1 - (1 - 0.3) 5 ] = (0.9775) x (0.9844) 

x (0.83193) = 0.8063. The remaining choice is 0 units of n i and 5 units 
of n 2 , with a reliability of R s = [1 - (1 - 0.85) 1 ] x [1 (1 0.5) 8 ] x 

[1 - (1 - 0.3) 5 ] = (0.85) x (0.9961) x (0.83193) = 0.7044. The second 
choice is the best. 

7. Lastly, a cost increment of 6 units allows a policy of 3 units for n\ and 
0 units for n 2 , which gives a reliability of R s = [1 — (1 - 0.85) 4 ] x [1 

- (1 - 0.5) 3 ] x [l = (l - 0.3) 5 ] = (0.99949) x (0.875) x (0.83193) = 
0.72757, which is not an improvement from the previous policy. Another 
choice is 2 units for >i\ and 2 units for n 2 , which gives a reliability of R s 
= [1 - (1 - 0.85) 3 ] x [l - (1 - 0.5) 5 ] x [1 - (1 _ 0.3) 5 ] =(0.996625) 
x (0.96875) x (0.83193) = 0.8032; because this is less than the previous 
policy, it is not optimum. The remaining choice is 1 unit for n.\ and 4 
units for n 2 , which gives a reliability of R s = [1 - (1 - 0.85) 2 ] x [1 - 
(1 - 0.5) 7 ] x [l - (l - o.3) 5 ] = (0.9775) x (0.99219) x (0.83193) = 
0.80686. 

The details of constructing Table 3 are as follows: 

1. If there is 0 weight allocation for the incremental policy, the minimum 
system prevails as is shown in the last row of Table 3. If 1 unit of cost is 
available, it could be added to element 3 to give a reliability of R s = [1 

- (1 - 0.85) 1 ] x [1 - (1 - 0.5) 3 ] x |1 - (l - 0.3) 6 ] = (0.85) x (0.875) 
x (0.8824) = 0.6562. Inspection of Table 2 shows that this is inferior 
to allocating the single cost unit to subsystems 1 and 2. Thus the entry 
from Table 2 is inserted in Table 3. 

2. If 2 cost units are available, they can be used for two additional subsys¬ 
tem 3 units to give a reliability of R s =[1 - (1 - 0.85) 1 ] x [1 - (1 - 
0.5) 3 ] x [l - (1 - 0.3) 7 ] = (0.85) x (0.875) x (0.9176) = 0.6825. Another 
choice is 1 unit for subsystem 3 and the optimum 1-unit cost from Table 

2, which is one additional subsystem 2 that gives a reliability of R s = [1 
-(1 - 0.85) 1 ] x [1- (1 - 0.5) 4 ] x 11 _ (l _ 0.3) 6 ] =(0.85) x (0.9375) 
x (0.8824) = 0.7032. The last possible choice is 0 cost for subsystem 

3. Table 2 shows that all the weight is allocated to subsystem 1, which 
achieves a reliability of 0.7116. This solution is entered in Table 3. 

3. For the cost increment of 3 units, one choice is to allocate all of this to 
subsystem 3 to give a reliability of R s =[1 - (1 - 0.85) 1 ] x [1 - (1 - 
0.5) 3 ] x [1 - (1 - 0.3) 8 ] =(0.85) x (0.875) x (0.9424) = 0.70091. If 
2 cost units are allocated to subsystem 2, then Table 2 shows that we 
should allocate the remaining cost to purchase an additional subsystem 
1, from which the reliability becomes R s =[1 - (1 - 0.85) 2 ] x [1 - (1 - 
0.5) 3 ] x [l _ (l - 0.3) 6 ] = (0.9775) x (0.875) x (0.8824) = 0.7547. The 
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remaining choice is to allocate 0 cost to subsystem 3 and use the solution 
for 3 cost units from Table 2, which uses one additional subsystem 1 and 
one additional subsystem 2 and gives the best reliability of 0.7624. For 
the case of 4 cost units, all can be allocated to subsystem 3 to give a 
reliability of R s = [1 - (1 -0.85)'] x [1 - (1 - 0.5) 3 ] x [1 — (1 - 
0.3) 9 ] = (0.85) x (0.9375) x (0.99974) = 0.7967. Allocating 3 cost units 
to subsystem 3 plus the remaining unit to subsystem 2 gives a reliability 
of R s = [1 - (1 - 0.85) 1 ] x [l - (l - 0.5) 4 ] x [1 - (1 - 0.3) 8 ] =(0.85) 
x (0.9375) x (0.9424) = 0.7509. By allocating 2 cost units to subsystem 
3, Table 2 reveals that the remaining 2 cost units should be allocated to 
subsystem 1 to give a reliability of R s =[1 - (1 - 0.85) 2 ] x [1 - (1 
- 0.5) 3 ] x [1 - (1 - 0.3) 7 ] = (0.9775) x (0.875) x (0.9176) =0.7848. 
Allocating 1 cost unit to subsystem 3 and, from Table 2, 1 additional unit 
for subsystems 1 and 2 gives a reliability of R s =[1 - (1 — 0.85) 2 ] x [1 — 
(1 - 0.5) 4 ] x 11 .(l 0.3) 6 ] =(0.9775) (0.9375) x (0.8824) = 0.8086. 

Allocating 0 cost units to subsystem 3 and, from Table 2, 1 additional 
unit for subsystem 1 as well as 2 additional units for subsystem 2 gives 
a reliability of R s =[1 - (1 - 0.85) 2 ] x [1 - (1 - 0.5) 5 ] x [1 (1 - 

0.3) 5 ] = (0.9775) x (0.96875) x (0.8319) = 0.7878. 

4. Similar computations yield the allocations for the 5 and 6 cost units 
shown in Table 3. 


We now describe Phase II of dynamic programming: the backtracking proce¬ 
dure. This procedure is merely a reorganization of the information contained in 
the phase I tables so that an optimum policy can be easily chosen. In a “short¬ 
hand” way, the cost allocated to each subsystem debnes the policy because 
dividing the cost by the cost per element yields the number of elements. 

The optimum policy for a cost constraint of 6 units is found by starting 
at the point in the optimum reliability column of Table 7.10 that corresponds 
to a cost constraint of 6 (c = 6 ). The optimum reliability is 0.8689; to the 
immediate left, we see that for this policy, 2 cost units (A 713 = 2 ) have been 
allocated to ss$ (subsystem 3)—leaving 4 units available. If we look in the 
allocation to ssi for a 4-unit cost constraint, we see that 2 cost units are used; 
thus (A «2 =2). This leaves 2 cost units for the first subsystem, which means 
(Atti = I )• The augmentation policy is therefore (1, 2, 2); when added to the 
minimum system design (1, 3, 5), it yields the optimal policy (2, 5, 7). The 
circles and lines in Table 7.10 connect the backtracking steps. A feature of 
the dynamic programming solution is that it gives the optimal solution for all 
constraint values below the maximum. For example, suppose that we wanted 
the solution for 4 cost units. By backtracking, we have 1 unit for 773 , 1 unit 
for t? 2 > and 1 unit for n\. The policy, together with the minimal system design, 
is (2, 4, 6 ), which achieves a reliability of 0.8086. For additional examples of 
dynamic programming, see Messinger [1970]. 
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TABLE 7.10 Phase II: Backtracking Table for Reliability Augmentation 


Cost 

Constraint 

Allocation to 
ss | Cost 

Allocation to 
ss 2 Cost 

Allocation to 
ss 3 Cost 

Optimum 

Reliability 

6 

6 

4 

Cl) 

0.8689 

5 

4 

3 

2 

0.8409 

4 

4 


___(0 

0.8086 

3 

2 


0 

0.7624 

2 

(2 'y" 

0 

0 

0.7116 

1 

0 

1 

0 

0.6629 

0 

0 

0 

0 

0.6187 


Note: Solid line is the policy for a cost constraint of 6; dashed line is the policy for a cost 
constraint of 4. 


7.12.5 Use of a Bounded Approach to Check Dynamic Programming 
Solution 

To check the results of the dynamic programming solution to Example 3, a 
slightly revised version of the algorithm in Table 7.4 is written for Example 
3 and an associated program was run for reliability values that exceed 0.8. 
The program generated 14 solutions with costs of 14, 15, and 16 units. The 
optimum reliabilities for each of these cost constraints are given in Table 7.11. 
The results given in Table 7.11 are, of course, identical with those obtained 
by backtracking in Table 7.10. 


TABLE 7.11 Computation of Optimum Reliability for Example 3 
Using the Minimum System Design and Augmentation Policy 


Minimum System Design 


n\ 

n 2 

«3 



1 

3 

5 




Augmentation Policy 


«l 

>n 

«3 

C 

R s 

2 

5 

7 

16 

0.8689674 

2 

4 

7 

15 

0.8409362 

2 

4 

6 

14 

0.808592 
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7.13 CONCLUSION 

The methods of this chapter provide a means of implementing component reli¬ 
ability at the lowest level possible in most systems: the line-replaceable unit 
(LRU) level. However, in some cases the designer may not wish to implement 
strict component redundancy. For example, if a major portion of a system is 
already available because it was used in a previous design, then it may be most 
cost-effective to use this as a fixed subsystem. If the reliability is too low, we 
merely place additional copies of this subsystem in parallel rather than delve 
within the design to provide a lower-level redundancy. A similar case occurs 
when portions of a design are being implemented by using existing very high 
level integrated circuits. 

Optimizing a design is a difficult problem for many reasons. Designers often 
rush to meet schedule and costs and look for feasible solutions that meet the 
performance requirements; thus reliability may be treated as an afterthought. 
This approach seldom leads to a design with optimum reliability—much less 
a good suboptimal design. The methods outlined in this chapter provide the 
designer with many tools to rapidly generate a family of good optimum and 
suboptimum system designs. This provides guidance when choices must be 
made rapidly and conflicting design constraints must be satisfied. 
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PROBLEMS 

7.1. Why was Eq. (7.3) written in terms of probability of failure rather than 
probability of success? 

7.2. There have been many studies of software errors that relate the number 
of errors to the number of interfaces. Suppose that many of the hardware 
design errors are also related to the number of interfaces. What can you 
say about the complexity of the design and the number of errors based 
on the results of Sections 7.5.2 and 7.5.3? 

7.3. Repeat the apportionment example of Section 7.6.1 for reliability goals 
of 0.90 and 0.99. 

7.4. Repeat the apportionment example of Section 7.6.2 for reliability goals 
of 0.90 and 0.99. 

7.5. Repeat the apportionment example of Section 7.6.3 for reliability goals 
of 0.90 and 0.99. 

7.6. Repeat the apportionment example of Section 7.6.4 for reliability goals 
of 0.90 and 0.99. 

7.7. Comment on the results of problems 7.3-7.6 with respect to the difficulty 
of the computations, how close the results agree, and which results you 
think are the most realistic. 

7.8. Derive Eqs. (7.28a, b) by formulating a Markov model and solving the 
associated equations. 

7.9. Suppose the reliability goal for Example 1 of Section 7.7.2 is 0.95 and 
compute the minimum system design. Repeat for a reliability goal of 
0.99. 
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7.10. Repeat problem 7.9 for Example 2. 

7.11. Write a computer program corresponding to the algorithm of Table 7.4 
and verify the results of Tables 7.5 and 7.6. 

7.12. Change the program of problem 7.11 so that it prints out the results in 
descending order of reliability. 

7.13. Repeat problem 7.11 for the algorithm of Table 7.7. 

7.14. Repeat problem 7.12 for Example 2. 

7.15. Rewrite the algorithm of Table 7.4 to include volume and weight con¬ 
straints as discussed in Section 7.8.5. 

7.16. Repeat problem 7.15 for the algorithm of Table 7.7. 

7.17. Compare the results of apportionment with the optimum system design 
for Example 1 where the reliability goals are 0.95 and 0.99, as was done 
in Section 7.9. 

7.18. Compare the results of apportionment with the optimum system design 
for Example 2 where the reliability goals are 0.95 and 0.99, as was done 
in Section 7.9. 

7.19. Repeat problem 7.9 for Example 3 given in Section 7.12.2, with reli¬ 
ability goals of 0.85, 0.90, and 0.95. 

7.20. Write a computer program to solve for the minimum system design and 
the augmentation policy of Example 3 of Section 7.12.2. 

7.21. Modify the algorithm of Table 7.4 for the case of standby systems as 
discussed in Section 7.10. 

7.22. Repeat problem 7.21 for Example 2. 

7.23. Repeat problem 7.21 for Example 3. 

7.24. Write a general program for bounded optimization that includes the fol¬ 
lowing features: (a) the input of the number of series subsystems and 
their reliability, cost, weight, and volume; and (b) the input of the sys¬ 
tem reliability goals and the cost, weight, and volume constraints. Then, 
specify which subsystems will use parallel and which will use standby 
redundancy. Policies are to be printed in descending reliability order. 

7.25. Write a program to solve the greedy algorithm of Section 7.11.2. 

7.26. Use the greedy algorithm of Section 7.11.2 to solve for the optimum for 
Example 1. 

7.27. Repeat problem 7.26 for Example 2. 

7.28. Repeat problem 7.26 for Example 3. 



PROBLEMS 383 


7.29. Repeat problem 7.26 for the multiple constraints discussed in Section 
7.11.3. 

7.30. Write a program to solve for the dynamic programming algorithm of 
Section 7.12 and verify Tables 7.9 and 7.10. 

7.31. A satellite communication system design is discussed in the article by 
Mancino [1986]. The structure is essentially a series system with many 
components paralleled to increase the reliability. Practical failure rates 
are given for the system. Use the article to redesign the system by incor¬ 
porating optimum design principles, making any reasonable assump¬ 
tions. How does your design compare with that in the article? How was 
the design in the article achieved? 

7.32. Starting with a component that has a failure rate of X, compare two dif¬ 
ferent ways of improving reliability: (a) by placing a second component 
in parallel and (b) by improving the reliability of a single component 
by high-quality design. What is the reduced equivalent failure rate of 

(a) ? Comment on what you think the cost would be to achieve the same 
reductions if (b) is used. 

7.33. Starting with a component that has a failure rate of X, compare two dif¬ 
ferent ways of improving reliability: (a) by placing a second component 
in parallel and (b) by placing a second component in standby. What is 
the reduced equivalent failure rate of (a)? Of (b)? Comment on what you 
think the comparative cost would be to achieve the same reductions if 

(b) is used. 

7.34. Choose a project with which you are familiar. Decompose the structure 
as was done in Fig. 7.3; then discuss. 



APPENDIX A 


SUMMARY OF PROBABILITY 
THEORY* 


A1 INTRODUCTION 

Several of the analytical techniques discussed in this text are based on prob¬ 
ability theory. Many readers have an adequate background in probability and 
need only refer to this appendix for notation and brief review. However, some 
readers may not have studied probability, and this appendix should serve as a 
brief and concise introduction for them. If additional explanation is required, 
an introductory probability text should be consulted [Meeker, 1998; Menden¬ 
hall, 1990; Stone, 1996; Wadsworth and Bryan, I960]. 


A2 PROBABILITY THEORY 

“Probability had its beginnings in the 17th century when the Chevalier de Mere, 
supposedly an ardent gambler, became puzzled over how to divide the win¬ 
nings in a game of chance. He consulted the French mathematician Blaise 
Pascal (1623-1662), who in turn wrote about this matter to Pierre Fermat 
(1601-1665); it is this correspondence which is generally considered the ori¬ 
gin of modern probability theory” [Freund, 1962]. In the 18th century Karl 
Gauss (1777-1855) and Pierre Laplace (1749-1827) further developed proba¬ 
bility theory and applied it to fields other than games of chance. 

Today, probability theory is viewed in three different ways: the a pri- 


*This appendix is largely extracted from Appendix A of Software Engineering: Design, 
Reliability, and Management, by M. L. Shooman, McGraw-Hill, New York, 1983. 


384 




PROBABILITY THEORY 385 


ori (equally-likely-events) approach, the relative-frequency approach, and the 
axiomatic definition [Papoulis, 1965]. Intuitively we state that the probabil¬ 
ity of obtaining the number 2 on a single roll of a die is Assuming each 
of the six faces is equally likely and that there is one favorable outcome, we 
merely take the ratio. This is a convenient approach; however, it fails in the 
case of a loaded die, where all events are not equally likely, and also in the 
case of compound events, where the definition of “equally likely” is not at 
all obvious. The relative-frequency approach begins with a discussion of an 
experiment such as the rolling of a die. The experiment is repeated n times 
(or n identical dice are all rolled at the same time in identical fashion). If n 2 
represents the number of times that two dots face up, then the ratio n 2 /n is 
said to approach the probability of rolling a 2 as n approaches infinity. The 
requirement that the experiment be repeated an infinite number of times and 
that the probability be defined as the limit of the frequency ratio can cause 
theoretical problems in some situations unless stated with care. The newest 
and most generally accepted approach is to base probability theory on three 
fundamental axioms. The entire theory is built in a deductive manner on these 
axioms in much the same way plane geometry is developed in an axiomatic 
manner. This approach has the advantage that if it is followed carefully, there 
are no loopholes, and all properties are well defined. As with any other the¬ 
ory or abstract model, the engineering usefulness of the technique is measured 
by how well it describes problems in the physical world. In order to evaluate 
the parameters in the axiomatic model one may perform an experiment and 
utilize the relative-frequency interpretation or evoke a hypothesis on the basis 
of equally likely events. In fact a good portion of mathematical statistics is 
devoted to sophisticated techniques for determining probability values from 
an experiment. 

The axiomatic approach begins with a statement of the three fundamental 
axioms of probability: 

1. The probability that an event A occurs is a number between zero and 
unity: 

0 < P(A) < 1 (Al) 

2. The probability of a certain event (also called the entire sample space or 
the universal set ) is unity: 

P(S) = 1 (A2) 

3. The probability of the union (also called sum) of two disjoint (also called 
mutually exclusive ) events is the sum of the probabilities: 

P(A l +A 2 ) = P(A l ) + P(A 2 ) (A3) 
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(b) 


Figure A1 Venn diagram illustrating the union of sets A\ and A 2 : (a) ordinary sets; 
(b) disjoint sets. 


A3 SET THEORY 
A3.1 Definitions 

Since axiomatic probability is based on set theory, we shall discuss briefly a few 
concepts of sets. The same concept often appears in set theory and in probability 
theory, with different notation and nomenclature being used for the same ideas. 
A set is simply a collection or enumeration of objects. The order in which the 
objects of the set are enumerated is not significant. Typical sets are the numbers 
1, 2, 3, all 100 cars in a parking lot, and the 52 cards in a deck. Each item in the 
collection is an element of the set. Thus, in the examples given there are 3, 100, 
and 52 elements, respectively. Each set (except the trivial one composed of only 
one element) contains a number of subsets. The subsets are defined by a smaller 
number of elements selected from the original set. To be more specific one first 
defines the largest set of any interest in the problem and calls this the universal 
set U. The universal set contains all possible elements in the problem. Thus, a 
universal set of n elements has a maximum of 2" distinct subsets. The univer¬ 
sal set might be all cars in the United States, all red convertibles in New York, 
or all cars in the parking lot. This is a chosen collection which is fixed through¬ 
out a problem. In probability theory, the type of sets one is interested in consists 
of those which can, at least in theory, be viewed as outcomes of an experiment. 
These sets are generally called events. When the concept of universal set is used 
in probability theory, the term sample space S is generally applied. It is often con¬ 
venient to associate a geometric picture, called a Venn diagram, with these ideas 
of sample space and event (or set and subset), and the sample space is represented 
by a rectangle (see Fig. Al). 

A3.2 Axiomatic Probability 

With the above background one can discuss intelligently the meaning of prob¬ 
ability axioms 1 and 2 given in Eqs. (Al) and (A2). Equation (Al) implies 
that the probability of an event A is a positive number between zero and one. 
From the relative-frequency interpretation we know that the probability of a 
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certain event is unity and the probability of an impossible event is zero. All 
other events have probabilities between zero and unity. In Eq. (A2) we let the 
event A be the entire sample space S, and not too surprisingly we bnd that 
this is a certain event. This is true because we say that S occurs if at least one 
element of S occurs. 

A3.3 Union and Intersection 

The union of sets A\ and A 2 is a third set B. Set B contains all the elements 
which are in set A x or in set A 2 or in both sets A 1 and At. Symbolically, 

B — A 1 U At or B — A\ + A 2 (A4) 

The U notation is more common in mathematical work, whereas the + nota¬ 
tion is commonly used in applied work. The union operation is most easily 
explained in terms of the Venn diagram of Fig. Al(a). Set A] is composed of 
disjoint subsets C 1 and Ct and set A 2 of disjoint subsets Co and C 3 . Subset 
Ct represents points common to A 1 and A 2 , whereas C\ represents points in 
A\ but not in At, and C 3 represents points that are in At but not A\. When the 
two sets have no common elements, the areas do not overlap [Fig. Al(b)], and 
they are said to be disjoint or mutually exclusive. 

The intersection of events A x and At is defined as a third set D which is 
composed of all elements which are contained in both Aj and At. The notation 
is: 


D=Ai n A 2 or D = A\A 2 or D=A\-A 2 (A5) 

As before, the former is more common in mathematical literature and the latter 
more common in applied work. In Fig. Alfa), A\A 2 = Ct, and in Fig. Al(b), 
AiAt = 0. If two sets are disjoint, they contain no common elements, and their 
intersection is a set with no elements called a null set, 0. P(0) = 0. 

A3.4 Probability of a Disjoint Union 

We can now interpret the third probability axiom given in Eq. (A3) in terms of 
a card-deck example. The events in the sample space are disjoint and (using 
the notation S 2 = three of spades, etc.), 

P(spades) = P(S X + S 2 + ■ ■ ■ + S Q + S K ) 

Since all events are disjoint, 

P(spades) = TOi) + P(S 2 ) + • • • + P(S Q ) + P(S K ) (A6) 

From the equally-likely-events hypothesis one would expect that for a fair deck 
(without nicks, spots, bumps, torn corners, or other marking) the probability 
of drawing a spade is given by: 
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P(spades) - 52 + 52 + f 52 + 52 ~ 52 ' 4 


A4 COMBINATORIAL PROPERTIES 
A4.1 Complement 

The complement of set A, written as A, is another set B. The notation A' is 
sometimes used for complement, and both notations will be used interchange¬ 
ably in this book. Set B = A is composed of all the elements of the universal 
set which are not in set A. (The term A not is often used in engineering circles 
instead of A complement.) By definition the union of A and A is the universal 
set. 


A+A = U (A7) 

Applying axioms 2 and 3 from Eqs. (A3) and (A2) to Eq. (A7) yields 

P(A + A) = P(A ) + P(A) = P(S) = I 

This is valid since A and A are obviously disjoint events (we have substituted 
the notation S for U, since the former is more common in probability work). 
Because probabilities are merely numbers, the above algebraic equation can 
be written in three ways: 


P(A) + P(A) = 1 

P(A) = 1 - P(A) 

P(A) = 1 - P(A) (A8) 

There is considerable similarity between the logic operations presented above 
and the digital logic of Section Cl. 

A4.2 Probability of a Union 

Perhaps the first basic relationship to be deduced is the probability of a union 
of two events which are not mutually exclusive. We begin by extending the 
axiom of Eq. (A3) to three or more events. Assuming that event A 2 is the union 
of two other disjoint events B\ + 5 2 , we obtain 


A2 — B 1 + Bo 

P(A 2 +AoJ = P(A l ) + P(B\ + Bo) = P{A\) + P(B\)+ P(B 2 ) 

By successive application of this stratagem of splitting events into unions of 
other mutually exclusive events, we obtain the general result by induction 
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P(A i + A 2 +-h A„) = P(A \) + P(A 2 ) +- 1 - P(A „) for disjoint A’s 

(A9) 

If we consider the case of two events Ai and At which are not disjoint, 
we can divide each event into the union of two subevents. This is most easily 
discussed with reference to the Venn diagram shown in Fig. Al(a). The event 
(set) A i is divided into those elements (1) which are contained in A j and not 
in At, Ci and (2) which are common to A( and At, C 2 . Then A\ = C\ + C 2 . 
Similarly we define A 2 = C 3 + C 2 . We have now broken A \ and A 2 into disjoint 
events and can apply Eq. (A9): 

m, + At) = P(C ! + C 2 + Ct + C 3 ) = P[C 1 + C 3 + (C 2 + C 2 )j 


By definition, the union of C 2 with itself is C 2 ; therefore 

TOi + A 2 ) = P{C\ + C 2 + C 3 ) = PfCO + P(C 2 ) + P(C 3 ) 

We can manipulate this result into a more useful form if we add and subtract 
the number P(C 2 ) and apply Eq. (A3) in reverse 

m, + At) = [P(co + P(C 2 )\ + [P(C 2 ) + P(C 3 )] - P(C 2 ) 

= P(A l ) + P(A 2 )-P(A l A 2 ) (A10) 


Thus, when events Ai and A 2 are not disjoint, we must subtract the probability 
of the union of Ai and A 2 from the sum of the probabilities. Note that Eq. 
(A10) reduces to Eq. (A3) if events Ai and A 2 are disjoint since P{A\A 2 ) =0 
for disjoint events. 

Equation (A10) can be extended to apply to three or more events: 

P(A\ + A 2 +- 1 - A„) 


= [/ j (A 1 ) + P(At) + --- + j P(A„)] 


<— 



= n terms 


P(A 1 At) + P(A 1 A 3 )+••• + /> 




terms 


P(A 1 A 2 A 3 ) + P(A 1 A 2 A 4 )+- 


• + P A, Aj A k 


<— 


terms 


(-1)" - 1 [^(AiAa - ■ ■ A„)] 





= 1 term 


(All) 
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The complete expansion of Eq. (All) involves (2" - 1) terms. 

A4.3 Conditional Probabilities and Independence 

It is important to study in more detail the probability of an intersection of two 
events, that is, P(AiA 2 ). We are especially interested in how P(A | A 2 ) is related 
to Pi A i) and P(A 2 ). 

Before proceeding further we must define conditional probability and intro¬ 
duce a new notation. Suppose we want the probability of obtaining the four of 
clubs on one draw from a deck of cards. The answer is of course 1/52, which 
can be written: P(Czt) = 1/52. Let us change the problem so it reads: What is 
the probability of drawing the four of clubs given that a club is drawn ? The 
answer is 1/13. 

In such a situation we call the probability statement a conditional probabil¬ 
ity. The notation P{C^\C) = 1/13 is used to represent the conditional proba¬ 
bility of drawing a four of clubs given that a club is drawn. We read P(A 2 |Ai) 
as the probability of At occurring conditioned on the previous occurrence of 
Ai, or more simply as the probability of At given A\. 

P(A 1 At) = P(A 1 )P(At|A 1 ) (A 12a) 

P(A 1 At) = P(A 2 )P(A 1 |At) (A12b) 

Intuition tells us that there must be many cases in which 

P(At|A 1 ) = P(A 2 ) 


In other words, the probability of occurrence of event At is independent of the 
occurrence of event A\. From Eq. (A12a) we see that this implies P(AiA 2 ) = 
P(Ai)P(At), and this latter result in turn implies 

P(A,|At) = P(A 1 ) 


Thus we define independence by any one of the three equivalent relations 


P(A 1 At)=P(A 1 )P(At) 

(A13a) 

P(A,|At) = P(A 1 ) 

(A13b) 

P(A 2 |A 1 ) = P(A 2 ) 

(A 13c) 


Conditional probabilities are sometimes called dependent probabilities. 

One can define conditional probabilities for three events by splitting event 
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B into the intersection of events A 2 and A 3 . Then letting A =A\ and B = A 2 A 3 , 
we have 


P(AB ) = P(A)P(B\A) = P(A 1 )P(A 2 A, |A t ) 

= P(A 1 )P(A 2 \A 1 )P(A 3 \A 1 A 2 ) 

Successive application of this technique leads to the general result 

P(AiA 2 • • - A„) = P(A l )P(A 2 \A l )P(A 3 \A 1 A 2 ) ■ ■ ■ 

P(A n \AiA 2 • • ■ A„_ 1 ) (A 14) 

Thus, the probability of the union of n terms is expressed as the joint product 
of one independent probability and n — 1 dependent probabilities. 


A5 DISCRETE RANDOM VARIABLES 
A5.1 Density Function 

We can define x as a random variable if we associate each value of x with 
an element in event A defined on sample space S. If the random variable x 
assumes a finite number of values, then x is called a discrete randoin variable. 
In the case of a discrete random variable, we associate with each value of 
x a number x, and a probability of occurrence P(x,). We could describe the 
probabilities associated with the random variable by a table of values, but it is 
easier to write a formula that permits calculation of P(x,) by substitution of the 
appropriate value of x,. Such a formula is called a probability function for the 
random variable x. More exactly, we use the notation/(x) to mean a discrete 
probability density function associated with the discrete random variable x. 
(The reason for the inclusion of the word “density” will be clear once the 
parallel development for continuous random variables is completed.) Thus, 

P(x = x,) = P( Xi ) =/(x,) (A15) 

In general we use the sequence of positive integers 0, 1 , 2, ... , n to represent 
the subscripts of the n + 1 discrete values of x. Thus, the random variable is 
denoted by x and particular values of the random variable by Xi,x 2 ,..., x„ . If 
the random variable under consideration is a nonnumerical quantity, e.g., the 
colors of the spectrum (red, orange, yellow, green, blue, indigo, violet), then 
the colors (or other quantity) would first be coded by associating a number 1 
to 7 with each. If the random variable x is defined over the entire sample space 
S, P(A) is given by 

P(A)= Z P(xd = Z /(*,■) (A16) 

for all Xi for all xi 

values which in A 

are elements of A 
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Figure A2 (a) Line diagram depicting the discrete density function for the throw of 
one die; (b) step diagram depicting the discrete distribution function for the density 
function given in (a). 
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The probability of the sample space is 


P(S) = Xf(x i )=l 

over all 
i 


(A17) 


As an example of the above concepts we shall consider the throw of one 
die. The random variable x is the number of spots which face up on any throw. 
The domain of the random variable is x = 1, 2, 3, 4, 5, 6. Using the equally- 
likely-events hypothesis, we conclude that 


P(x = l) = P(x = 2)=-=± 

Thus,/(x) = 1/6, a constant density function. This can also be depicted graphi¬ 
cally as in Fig. A2(a). The probability of an even roll is 

P(even) = X fix ,)« | } + i = } 

( = 2 , 4,6 

A5.2 Distribution Function 

It is often convenient to deal with another, related function rather than the den¬ 
sity function itself. The distribution function is defined in terms of the proba¬ 
bility that x < x: 


Pix < x) = F(x) = X fix ) (A18) 

X<X 


The distribution function is a cumulative probability and is often called the 
cumulative distribution function. The analytical form of Fix) for the example 
in Fig. A2 is 
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Fix) = — for 1 < x < 6 (A19) 

6 

Equation (A19) related F(x ) to fix) by a process of summation. One can 
write an inverse relation 1 defining/(x) in terms of the difference between two 
values of Fix) 


f(x) = F(x + )-F(x-) (A20) 

In other words, f(x) is equal to the value of the discontinuity at x in the step 
diagram of Fix). There are a few basic properties of density and distribution 
functions which are of importance: (1) since/(x) is a probability, 0 </(x) < 1; 
(2) because P(S ) =1, 

x m = i 

all 


A5.3 Binomial Distribution 

Many discrete probability models are used in applications, the foremost being 
the binomial distribution and the Poisson distribution. The binomial distri¬ 
bution (sometimes called the Bernoulli distribution) applies to a situation in 
which an event can either occur or not occur (the more common terms are 
success or failure, a legacy from the days when probability theory centered 
around games of chance). The terms success and failure, of course, are ideally 
suited to reliability applications. The probability of success on any one trial is 
p, and that of failure is 1 — p. The number of independent trials is denoted by 
n, and the number of successes by r. Thus, the probability of r successes in n 
trials with the probability of one success being p is 


B(r;n,p) 


n 

r 


p\\-pT r 


for r = 0,1,2,... ,n 


(A21) 


where 

j = n\/r\(n - r)\ = number of combinations of n things taken r at a time 

A number of line diagrams for the binomial density function 2 are given in 
Fig. A3. In Fig. A3 the number of trials is fixed at nine, and the probabil¬ 
ity of success on each trial is changed from 0.2 to 0.5 to 0.8. Intuition tells us 



'The notations Fix*) and F(x ) mean the limits approached from the right and left, respectively. 
2 We use the notation B(r;n,p) rather than the conventional and less descriptive notation/(x). 



394 SUMMARY OF PROBABILITY THEORY 


B„(r\ 9,p) 


0.2 h 
0 




lI 

11 


2 4 6 8 

n = 9, p = 0.2 



2 4 6 

n = 9,p = 0.5 


2 
n ■ 


4 

9 , P : 


Figure A3 Binomial density function for fixed n. (Adapted from Wadsworth and 
Bryan [I960].) 


that the most probable number of successes is np, which is 1.8, 4.5, and 7.2, 
respectively. (It is shown in Section A7 that intuition has predicted the mean 
value.) 

Example 1: Clearly, we could use the binomial distribution to predict the prob¬ 
ability of twice obtaining a 3, in six throws of a die: 

r =2 n —6 P~\ 

B( 2;6,i) = f (s) 2 d - 2 = 15 x 0.0131 = 0.196 

Example 2: We can also use the binomial distribution to evaluate the probabil¬ 
ity of picking three aces on ten draws with replacement from a deck; however, 
if we do not replace the drawn cards after each pick, the binomial model will 
no longer hold, since the parameter p will change with each draw. The bino¬ 
mial distribution does not hold when draws are made without replacement, 
because the trials are no longer independent. The proper distribution to use in 
such a case is the hypergeometric distribution [Freeman, 1963, pp. 113-120; 
Wadsworth and Bryan, 1960, p. 59]. 


H(k;j,n,N ) 



(A21a) 


where 

k = the number of successes 
j = the number of trials 
n = the finite number of possible successes 
N = the finite number of possible trials 
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A5.4 Poisson Distribution 

Another discrete distribution of great importance is the Poisson distribution, 
which can be derived in a number of ways. One derivation will be outlined 
in this section (see Shooman [1990], pp. 37-42), and a second derivation in 
Section A8. If p is very small and n is very large, the binomial density, Eq. 
(A21), takes on a special limiting form, which is the Poisson law of probability. 
Starting with Eq. (A21), we let np, the most probable number of occurrences, 
be some number p 


p 

p = np p = — 

n 


B 



nl 

rl(n - r)! 




The limiting form called the Poisson distribution is 


f(nn) 


p e 


v- 


(A22) 


The Poisson distribution can be written in a second form, which is very 
useful for our purposes. If we are interested in events which occur in time, we 
can define the rate of occurrence as the constant X = occurrences per unit time; 
thus p = \t. Substitution yields the alternative form of the Poisson distribution: 


/ 0 ;\, 0 


(\tye- Xt 


(A23) 


Line diagrams for the Poisson density function given in Eq. (A22) are shown 
in Fig. A4 for various values of p. Note that the peak of the distribution is near 
p and that symmetry about the peak begins to develop for larger values of p. 


A6 CONTINUOUS RANDOM VARIABLES 
A6.1 Density and Distribution Functions 

The preceding section introduced the concept of a discrete random variable and 
its associated density and distribution functions. A similar development will be 
pursued in this section for continuous variables. Examples of some continuous 
random variables are the length of a manufactured part, the failure time of a 
system, and the value of a circuit resistance. In each of these examples there is 
no reason to believe that the random variable takes on discrete values. On the 
contrary, the variable is continuous over some range of definition. In a manner 
analogous to the development of the discrete variable, we define a continuous 
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Figure A4 Poisson density function for several values of p. 

density function and a continuous distribution function. We shall start with the 
cumulative distribution function. 

The cumulative distribution function for the discrete case was defined in Eq. 
(A18) as a summation. If the spacings between the discrete values of the ran¬ 
dom variable x are Ax and we let Ax —» 0, then the discrete variable becomes 
a continuous variable, and the summation becomes an integration. Thus, the 
cumulative distribution function of a continuous random variable is given by 


Fix) = I f{x) dx (A24) 

J over the 
domain of x 

If we let x take on all values between points a and b 


P(x < x) - F{x) = I fix ) dx for a < x < b (A25) 

J a 

The density function/(x) is given by the derivative of the distribution function. 
This is easily seen from Eq. (A25) and the fact that the derivative of the integral 
of a function is the function itself. 

=m 


dFix) 

dx 


(A26) 
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The probability that x lies in an interval x < x < x + dx is given by 
P{x < x < x + dx) = P(x < x + dx) - Pix < x) 

I x + dx p* P x + dx 

fix) dx - I fix) dx = I fix) dx 
a J a J x 

= F{x + dx) - F(x) {All) 


It is easy to see from Eq. (A27) that if F(x) is continuous and we let dx —> 0, 
P(x =x) is zero. Thus, when we deal with continuous probability, it makes 
sense to talk of the probability that x is within an interval rather than at one 
point. In fact since the P(x =x) is zero, the numerical value is the same in the 
continuous case whether the interval is open or closed since 

P(a <x<b)~ P(a < x < b) = P(a <x<b)~ P{a < x < b) 


Thus, the density function f(x) is truly a density, and like any other density 
function it has a value only when integrated over some finite interval. The basic 
properties of density and distribution functions previously discussed in the dis¬ 
crete case hold in the continuous case. At the lower limit of x we have F{a) 
=0, and at the upper limit F(b) =1. These two statements, coupled with Eq. 
(A27), lead to \ n fix) dx = I. Since fix) is a probability, fix) is nonnegative, 
and F{x), its integral, is a nondecreasing function. 


A6.2 Rectangular Distribution 

The simplest continuous variable distribution is the uniform or rectangular dis¬ 
tribution shown in Fig. A5(a). The two parameters of this distribution are the 
limits a and b. This model predicts a uniform probability of occurrence in any 
interval 


Pix < x < x + Ax) = A x{b - a) 1 


between a and b. 

A6.3 Exponential Distribution 

Another simple continuous variable distribution is the exponential distribution. 
The exponential density function is 

fix) = \e~ Xjc 0 < x < +oo (A28) 

which is sketched in Fig. A5(b). This distribution recurs time and time again 
in reliability work. The exponential is the distribution of the time to failure t 
for a great number of electronic-system parts. The parameter X is constant and 
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Figure A5 Various continuous variable probability distributions: (a) uniform distri¬ 
bution; (b) exponential distribution; (c) Rayleigh distribution; (d) Weibull distribution; 
and (e) normal distribution. 
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is called the conditional failure rate with the units fractional failures per hour. 
The distribution function yields the failure probability and 1 —Fit) the success 
probability. Specifically, the probability of no failure (success) in the interval 
0 - t is given by 


P. s (t 1 ) = l - F( tl ) = e- Xh 


A6.4 Rayleigh Distribution 

Another single-parameter density function of considerable importance is the 
Rayleigh distribution, which is given as 

f(x) = Kxe Kxl/1 0 < x < +°o (A29) 

and for the distribution function, 


F{x ) = 1 - e~ Kx2/2 (A30) 

The density function is sketched in Fig. A5(c). The Rayleigh distribution finds 
application in noise problems in communication systems and in reliability 
work. Whereas the exponential distribution holds for time to failure of a com¬ 
ponent with a constant conditional failure rate X, the Rayleigh distribution 
holds for a component with a linearly increasing conditional failure rate Kt. 
The probability of success of such a unit is 


P s it) = 1 - Fit ) = e^' 1 


A6.5 Weibull Distribution 

Both the exponential and the Rayleigh distributions are single-parameter dis¬ 
tributions which can be represented as special cases of a more general two- 
parameter distribution called the Weibull distribution. The density and distri¬ 
bution functions for the Weibull are 


fix) = K X m e Ar"' +1 A m + 1 ) p {x) = X _ e Kx^'/tm+X) (A31 ) 

This family of functions is sketched for several values of m in Fig. A5(d). 
When m =0, the distribution becomes exponential, and when m = 1, a Rayleigh 
distribution is obtained. The parameter m determines the shape of the distri¬ 
bution, and parameter K is a scale-change parameter. 
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A6.6 Normal Distribution 

The best-known two-parameter distribution is the normal, or Gaussian, distri¬ 
bution. This distribution is very often a good bt for the size of manufactured 
parts, the size of a living organism, or the magnitude of certain electric signals. 
It can be shown that when a random variable is the sum of many other random 
variables, the variable will have a normal distribution in most cases. 

The density function for the normal distribution is written as 

fix) = - - e -oo < x < + °° 

<TV 2tt 

This function has a peak of l/aV27r at x =0 and falls off symmetrically on 
either side of zero. The rate of falloff and the height of the peak at x =0 
are determined by the parameter a, which is called the standard deviation. In 
general one deals with a random variable x which is spread about some value 
such that the peak of the distribution is not at zero. In this case one shifts the 
horizontal scale of the normal distribution so that the peak occurs at x =ji 

f{x) = - \= e^ x - » )2/2a2 (A32) 

(TV 27T 

The effect of changing a is as follows: a large value of a means a low, broad 
curve and a small value of a a thin, high curve. A change in p. merely slides 
the curve along the x axis. 

The distribution function is given by 

1 f X 

Fix) =-— e -^-^ )2/2a2 d£ (A32a) 

ffV2TT J ~ 

where £ is a dummy variable of integration. The shapes of the normal density 
and distribution functions are shown in Fig. A5(e). The distribution function 
given in Eq. (A32) is left in integral form since the result cannot be expressed in 
closed form. This causes no particular difficulty, since/(v) and Fix) have been 
extensively tabulated and approximate expansion formulas are readily available 
[Abramovitz and Stegun, 1972, pp. 931-936, Section 26.2]. In tabulating the 
integral of Eq. (A32) it is generally convenient to introduce the change of 
variables t = (x p)/o, which shifts the distribution back to the origin and 
normalizes the x axis in terms of a units. 

The area under the fix) curve between a and b is of interest since it repre¬ 
sents the probability that x is within the interval a < x < b. The areas for - 1 
<t< + 1, - 2 < t < + 2, and —3 < t < + 3 are shown in Fig. A6 along with a 
short table of areas between — oo and I. 

The normal distribution can also be used as a limiting form for many other 
distributions. The binomial distribution approaches the normal distribution for 
large n. 
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Figure A6 Area under the normal curve: (a) for — 1 <t< 1, — 2 < t < 2, and 
— 3 < t < 3; and (b) between — °° and various values of t. 


A7 MOMENTS 

The density or distribution function of a random variable contains all the infor¬ 
mation about the variable, i.e., the entire story. Sometimes the entire story of 
the random variable is not necessary, and an excerpt which sufficiently char¬ 
acterizes the distribution is sufficient. In such a case one computes a few 
moments (generally two) for the distribution and uses them to delineate the 
salient features. The moments are weighted integrals of the density function 
which describe various geometrical properties of the density function. 

A7.1 Expected Value 

It is easy to express the various moments of a probability distribution in terms 
of an operator called the expected value. The expected value of the continuous 
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random variable x defined over the range a <x< b with density function fix) 
is 


E(x) = f xfix) dx (A33) 

J a 

For a discrete random variable x taking on values x = X\,X 2 , ■ ■ ■,x„ the expected 
value is defined in terms of a summation 


n 


E(x) = Z Xif{xi) 

i = 1 


(A34) 


A7.2 Moments 

To be more general one defines an entire set of moments. The nth moment of 
the random variable x computed about the origin and defined over the range 
a < x < b is given by 


m r = j x'f(x) dx (A35) 

The zero-order moment mo is the area under the density function, which is, of 
course, unity. The first-order moment is simply the expected value, which is 
called the mean and is given the symbol fi 


mi=E(x)=fjL (A36) 

The origin moments for a discrete random variable which takes on the values 
Xi,X 2 , ■ ■ ■ ,x n are given by 


n 

m r = Z x-f(Xi) (A37) 

i= 1 

It is often of importance to compute moments about the mean rather than the 
origin. The set of moments about the mean are defined as follows: 

For continuous random variables: 


m r = £[(x - ix)’ ] = 1 (x - ix)'f(x) dx 


(A3 8) 


For discrete random variables: 
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TABLE A1 Mean and Variance for Several Distributions 


Distribution 

E(x) 

var x 

Binomial 

np 

np( 1 - p) 

Poisson 


pi 

Exponential 

1 

X 

1 


nr 

0.4292 

Rayleigh 

V 2 ~K 

K 


( K v~ e 

( K 2 

Weibull 

-r r(£ ) 

-- T(5) - [E(x)] 2 


\m + 1 J 

\m + 1 ) 


m + 2 

„ m + 3 


8 =- 

8 = 


m + 1 

m + 1 


T = the gamma function 


Normal 


a 2 


n 

m r = E\(x - n) r \ = X ( Xi - p(A39) 

i = 1 

The second moment about the mean, m' 2 = J + °lQc — m) 2 /( x ) dx, is called the 
variance of x, var x, and is a measure of the sum of the squares of the devi¬ 
ations from pi. Generally this is expressed in terms of the standard deviation 
a = %/var x. One can easily express var x and a in terms of the expected-value 
operator: 


a 2 = var x = E(x 2 ) - pc 2 

The means and variances of the distributions discussed in Section A5 are given 
in Table Al. 


A8 MARKOV MODELS 
A8.1 Properties 

There are basically four kinds of Markov probability models, one of which 
plays a central role in reliability. Markov models are functions of two random 
variables: the state of the system x and the time of observation t. The four 
kinds of models arise because both x and t may be either discrete or continuous 
random variables, resulting in four combinations. As a simple example of the 
concepts of state and time of observation, we visualize a shoe box with two 
interior partitions which divide the box into three interior compartments labeled 
1, 2, and 3. A Ping-Pong ball is placed into one of these compartments, and the 
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box is periodically tapped on the bottom, causing the Ping-Pong ball to jump 
up and fall back into one of the three compartments. (For the moment we 
neglect the possibility that it falls out onto the floor.) The states of the system 
are the three compartments in the box. The time of observation is immediately 
after each rap, when the ball has fallen back into one of the compartments. 
Since we specified that the raps occur periodically, the model is discrete in both 
state and time. This sort of model is generally called a Markov chain model or 
a discrete-state discrete-time model. When the raps at the bottom of the box 
occur continuously, the model becomes a discrete-state continuous-time model, 
called a Markov process. If we remove the partitions and call the long axis of 
the box the x axis, we can visualize a continuum of states from x = —1/1 
to x =+ 1/2. If the ball is coated with rubber cement, it will stick wherever 
it hits when it falls back into the box. In this manner we can visualize the 
other two types of models, which involve a continuous-state variable. We shall 
be concerned only with the discrete-state continuous-time model, the Markov 
process. 

Any Markov model is defined by a set of probabilities p t j which define the 
probability of transition from any state i to any state j. If in the discrete-state 
case we make our box compartments and the partitions equal in size, all the 
transition probabilities should be equal. (In the general case, where each com¬ 
partment is of different size, the transition probabilities are unequal.) One of 
the most important features of any Markov model is that the transition prob¬ 
ability pij depends only on states i and j and is completely independent of all 
past states except the last one, state i. This seems reasonable in terms of our 
shoe-box model since transitions are really dependent only on the height of the 
wall between adjacent compartments i and j and the area of the compartments 
and not on the sequence of states the ball has occupied before arriving in state 
i. Before delving further into the properties of Markov processes, an example 
of great importance, the Poisson process, will be discussed. 

A8.2 Poisson Process 

In Section A5.4 the Poisson distribution was introduced as a limiting form 
of the binomial distribution. In this section we shall derive the Poisson dis¬ 
tribution as the governing probability law for a Poisson process, a particular 
kind of Markov process. In a Poisson process we are interested in the number 
of occurrences in time, the probability of each occurrence in a small time At 
being a constant which is the parameter of the process. Examples of Poisson 
processes are the number of atoms transmuting as a function of time in the 
radioactive decay of a substance, the number of noise pulses as a function of 
time in certain types of electric systems, and the number of failures for a group 
of components operating in a standby mode or in an instantaneous-replacement 
situation. The occurrences are discrete, and time is continuous; therefore this 
is a discrete-state continuous-time model. The basic assumptions which are 
necessary in deriving a Poisson process model are as follows: 
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1. The probability that a transition occurs from the state of n occurrences 
to the state of n + 1 occurrences in time At as XA t. The parameter X 
is a constant and has the dimensions of occurrences per unit time. The 
occurrences are irreversible, which means that the number of occurrences 
can never decrease with time. 

2. Each occurrence is independent of all other occurrences. 

3. The transition probability of two or more occurrences in interval At 
is negligible. Another way of saying this is to make use of the 
independence-of-occurrence property and write the probability of two 
occurrences in interval At as the product of the probability of each occur¬ 
rence, that is, (XAf)(XAf). This is obviously an infinitesimal of second 
order for At small and can be neglected. 

We wish to solve for the probability of n occurrences in time f, and to that 
end we set up a system of difference equations representing the state prob¬ 
abilities and transition probabilities. The probability of n occurrences having 
taken place by time t is denoted by 


P(x = n,t) = P„(t) 


For the case of zero occurrences at time t + At we write the following differ¬ 
ence equation: 


P 0 (t + At) = (1 - \At)P 0 (t) (A40) 

which says that the probability of zero occurrences at time t + At is Po(t + At). 
This probability is given by the probability of zero occurrences at time t, Po(t), 
multiplied by the probability of no occurrences in interval At, 1—XAt. For the 
case of one occurrence at time t + At we write 

P x {t + At) = (\At)P 0 (t) + (1 - XA t)Pi(t) (A41) 

The probability of one occurrence at t + At, P\{t + At), can arise in two ways: 
(1) either there was no occurrence at time t, Po(t), and one happened in the 
interval At (with probability XAr), or (2) there had already been one occurrence 
at time t, P\(t), and no additional ones came along in the time interval At 
(probability 1 - XAt). It is clear that Eq. (A41) can be generalized, yielding 

P„(t + At) = (\At)P„-i(t) + (l -\At)P„(t) for n= 1,2,... (A42) 

The difference equations (A40) and (A41) really describe a discrete-time 
system, since time is divided into intervals At, but by taking limits as Ar —» 0 
we obtain a set of differential equations which truly describe the continuous¬ 
time Poisson process. Rearranging Eq. (A40) and taking the limit of both sides 
of the equation at At —» 0 leads to 
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lim 

Af^O 


P 0 (t + At) - Po(t) 
At 


lim —\P o(0 

At^O 


By definition the left-hand side of the equation is the time derivative of Po(t) 
and the right-hand side is independent of At; therefore 


dPp(t) 

dt 


h(t) 


-\pp(t) 


(A43) 


Similarly for Eq. (A41) 

P„(t + At) - P n (t) 

lim ---= lim lim A P„(t) 

At^O At A/^0 A/^0 

dP ^- = P n (t) = \P n - 1 (f) - \P„(t) 
dt 

for n = 1,2(A44) 

Equations (A43) and (A44) are a complete set of differential equations which, 
together with a set of initial conditions, describe the process. If there are no 
occurrences at the start of the problem, t = 0, n = 0, and 

Po(0) = 1,/MO) = P 2 (0) = •••*? P„(0) = 0 


Solution of this set of equations can be performed in several ways: classi¬ 
cal differential-equation techniques, Laplace transforms, matrix methods, etc. 
In this section we shall solve them using the classical technique of undeter¬ 
mined coefficients. Substituting a solution of the form Ae s! gives s =— X, and 
substituting the initial condition Pq(0) =1 gives 

Po(0 = e- U (A45) 


For n =1, Eq. (A42) becomes 

P,(0 = XP 0 (0 - XPj(0 

Substitution from Eq. (A45) and rearrangement yields 

Pi(t)+\P l (t) = \e Xt 

The homogeneous portion of this equation is the same as that for Po(t). The 
particular solution is of the form Bte x '. Substituting yields P> = X, and using 
the initial condition P\(0) =0 =A gives 


-\t 


Pi(t) = Xte 


(A46) 
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TABLE A2 A Transition Matrix 


Final States 


Initial States 



s 0 (t + At) 

s\(t + At) 

s 2 (t + At) ■■■ 

S„(t + At) 

so(0 

n 

= 0 

POO 

POl 

POl 

POn 

si(t) 

n 

= 1 

P 10 

P li 

P 12 

Pin 

s 2(0 

n 

= 2 

P20 

P21 

P22 

Pin 

S„(t) 

n 

= n 

PnO 

Pnl 

Pnl 

Pnn 


It should be clear that solving for P„(t ) for n = 2,3,... will generate the Poisson 
probability law given in Eq. (A23). (Note: m = r.) 

Thus, the Poisson process has been shown to be a special type of Markov 
process which can be derived from the three basic postulates with no mention 
of the binomial distribution. We can give another important interpretation to 
Po(t). If we let to be the time of the first occurrence, then Po(t) is the probability 
of no occurrences: 


P()(t) = P(t <t 0 ) = l- P(t 0 < t) 

Thus, 1 — Po{t) is a cumulative distribution function for the random variable 
t, the time of occurrence. The density function for time of first occurrence is 
obtained by differentiation: 

= Xl ) = \e Kl (A47) 

at 

This means that the time of first occurrence is exponentially distributed. Since 
each occurrence is independent of all others, it also means that the time 
between any two occurrences is exponentially distributed. 

A8.3 Transition Matrix 

Returning to some of the basic properties of Markov processes, we find that 
we can specify the process by a set of differential equations and their associ¬ 
ated initial conditions. Because of the basic Markov assumption that only the 
last state is involved in determining the probabilities, we always obtain a set 
of first-order differential equations. The constants in these equations can be 
specified by constructing a transition-probability matrix. 3 The rows represent 
the probability of being in any state A at time t and the columns the prob¬ 
ability of being in state B at time t + At. The former are called initial states 
and the latter final states. An example is given in Table A2 for a process with n 


3 In Appendix B6-B8 a flowgraph model for a Markov process will be developed which parallels 
the use of the transition matrix. The flowgraph model is popular in engineering analysis. 







408 SUMMARY OF PROBABILITY THEORY 


TABLE A3 The First Five Rows and Columns of the Transition Matrix for a 
Poisson Process 



soil + At) 

si(t + At) 

si(t + At) 

siit + At) 

s^t + At) 

so(t) 

1 -XA t 

XA t 

0 

0 

0 

Slit) 

0 

1 - XA t 

XA t 

0 

0 

Slit) 

0 

0 

1-XAf 

XA t 

0 

S3it) 

0 

0 

0 

1 - XA t 

XA f 

54(0 

0 

0 

0 

0 

1 - XA t 


+ 1 discrete states. The transition probability p t j is the probability that in time 
At the system will undergo a transition from initial state i to bnal state j. Of 
course a term on the main diagonal, is the probability that the system will 
remain in the same state during one transition. The sum of the pjj terms in any 
row must be unity, since this is the sum of all possible transition probabilities. 
In the case of a Poisson process, there is an infinite number of states. The 
transition matrix for the first five terms of a Poisson process is given in Table 
A3. Inspection of the Poisson example reveals that the difference equations 4 for 
the system can be obtained simply. The procedure is to equate the probability 
of any final state at the top of each column to the product of the transition 
probabilities in that column and the initial probabilities in the row. Specifically, 
for the transition matrix given in Table A2, 


P so (t + At) = paoPs 0 (t) + PioPs! (0 + ■ ■ ■ + PnoPs„(t) 


If the p^ terms are all independent of time and depend only on constants 
and At, the process is called homogeneous. For a homogeneous process, the 
resulting differential equations have constant coefficients, and the solutions are 
of the form e rl or t"e If for a homogeneous process the final value of 
the probability of being in any state is independent of the initial conditions, 
the process is called ergodic. A finite-state homogeneous process is ergodic 
if every state can be reached from any other state with positive probability. 
Whenever it is not possible to reach any other state from some particular state, 
the latter state is called an absorbing state. Returning to the partitioned shoe- 
box example of Section A8.1, if we allow the ball to hop completely out of 
the box onto the floor, the floor forms a fourth state, which is absorbing. In a 
transition matrix any column j having only a single entry (p,- ; along the main 
diagonal) is an absorbing state. 


4 The differential equations are obtained by taking the limit of the difference equations as At —» 0. 
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PROBLEMS 

Note: Problems A1-A5 are taken from Shooman [1990]. 

Al. The following two theorems are known as De Morgan’s theorems: 

A+B+C=ABC 
ABC = A + B+C 

Prove these two theorems using a Venn diagram. Do these theorems hold 
for more than three events? Explain. 

A2. We wish to compute the probability of winning on the first roll of a pair 
of dice by throwing a seven or an eleven. 

(a) Define a sample space for the sum of the two dice. 

(b) Delineate the favorable and unfavorable outcomes. 

(c) Compute the probability of winning and losing. 

(d) List any assumptions you made in this problem. 
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A3. Suppose a resistor has a resistance R with mean of 100 12 and a tolerance 

of 5%, i.e., variation of 5 12. 

(a) If the resistance values are normally distributed with fx = 100 12 and 
a = 5 12, sketch f(R). 

(b) Assume that the resistance values have a Rayleigh distribution. If the 
peak is to occur at 100 12, what is the value of K1 Plot the Rayleigh 
distribution on the same graph as the normal distribution of part (a). 

A4. A certain resistor has a nominal value (mean) of 100 12. 

(a) Assume a normal distribution and compute the value of a if we wish 
P (95 < R < 105) = 0.95. 

(b) Repeat part (a) assuming a Weibull distribution and specify the values 
of K and m. 

(c) Plot the density function for parts (a) and (b) on the same graph paper. 

A5. Let a component have a good, a fair, and a bad state. Assume the transition 

probabilities of failure are from good to fair, \ g fAf, from good to bad, 

XgbAf, and from fair to bad, \fi,At. 

(a) Formulate a Markov model. 

(b) Compute the probabilities of being in any state. 



APPENDIX B 


SUMMARY OF RELIABILITY 
THEORY* 


B1 INTRODUCTION 
Bl.l History 

Since its beginnings following World War II, reliability theory has grown into 
an engineering science in its own right. (The early development is discussed 
in Chapter 1 of Shooman [1990].) Much of the initial theory, engineering, and 
management techniques centered about hardware; however, human and proce¬ 
dural elements of a system were often included. Since the late 1960s the term 
software reliability has become popular, and now reliability theory refers to 
both software and hardware reliability. 


B1.2 Summary of the Approach 

The conventional approach to reliability is to decompose the system into 
smaller subsystems and units. Then by the use of combinatorial reliability, 
the system probability of success is expressed in terms of the probabilities 
of success of the elements. Then by the use of failure rate models, the element 
probabilities of success are computed. These two concepts are combined to 
calculate the system reliability. 

When reliability or availability of repairable systems is the appropriate hg- 


*Parts of this appendix have been abstracted from Appendix B of Software Engineering: Design, 
Reliability, and Management, by M. L. Shooman, McGraw-Hill, New York, 1983; and also 
Probabilistic Reliability: An Engineering Approach, 2d ed., by M. L. Shooman, Krieger, 1990. 
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ure of merit, Markov models are generally used to compute the associated 
probabilities. 

Often a proposed system does not meet its reliability specibcations, and var¬ 
ious techniques of reliability improvement are utilized to improve the predicted 
reliability of the design. 

Readers desiring more detail are referred to Shooman [1990] and the refer¬ 
ences cited in that text. 


B1.3 Purpose of This Appendix 

This appendix was written to serve several purposes. The prime reason is to 
provide additional background for those techniques and principles of reliabilty 
theory which are used in the software reliability models developed in Chap¬ 
ter 5. A second purpose is to expose software engineers who are not familiar 
with reliability theory to some of the main methods and techniques. This is 
especially important since many discussions of software reliability end up dis¬ 
cussing how much of “hardware reliability theory” is applicable to software. 
This author feels the correct answer is “some”; however, the only way to really 
appreciate this answer is to learn something about reliability. 

The third purpose is to allow readers who are software engineers to talk 
with and understand hardware reliability engineers. If a reliability and qual¬ 
ity control (R&QC) engineer handles the hardware reliability estimates and 
the software engineer generates software reliability estimates, they must meet 
at the interface. Even if the R&QC engineer computes reliability estimates 
for both the hardware and the software, it is still necessary for the software 
engineer to work with him or her and provide information as well as roughly 
evaluate the thoroughness and quality of the software effort. 


B2 COMBINATORIAL RELIABILITY 
B2.1 Introduction 

In performing the reliability analysis of a complex system, it is almost impos¬ 
sible to treat the system in its entirety. The logical approach is to decompose 
the system into functional entities composed of units, subsystems, or compo¬ 
nents. Each entity is assumed to have two states, one good and one bad. The 
subdivision generates a block-diagram or fault-tree description of system oper¬ 
ation. Models are then formulated to fit this logical structure, and the calculus 
of probability is used to compute the system reliability in terms of the subdivi¬ 
sion reliabilities. Series and parallel structures often occur, and their reliability 
can be described very simply. In many cases the structure is of a more com¬ 
plicated nature, and more general techniques are needed. 

The formulation of a structural-reliability model can be difficult in a large, 
sophisticated system and requires much approximation and judgment. This is 
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best done by a system engineer or someone closely associated with one who 
knows the system operation thoroughly. 


B2.2 Series Configuration 

The simplest and perhaps most common structure in reliability analysis is the 
series configuration. In the series case the functional operation of the system 
depends on the proper operation of all system components. A series string 
of Christmas tree lights is an obvious example. The word functional must be 
stressed, since the electrical or mechanical configuration of the circuit may 
differ from the logical structure. 

A series reliability configuration will be portrayed by the block-diagram 
representation shown in Fig. Bl(a), or the reliability graph shown in Fig. Bl(b). 
In either case, a single path from cause to effect is created. Failure of any 
component is represented by removal of the component, which interrupts the 
path and thereby causes the system to fail. 

The system shown in Fig. B1 is divided into n series-connected units. This 
system can represent n parts in an electronic amplifier, the n subsystems in an 
aircraft autopilot, or the n operations necessary to place a satellite in orbit. The 
event signifying the success of the nth unit will be x n , and x„ will represent the 
failure of the nth unit. The probability that unit n is successful will be P(x„), 
and the probability that unit n fails will be P(x n ). The probability of system 
success is denoted by P s . In keeping with the definition of reliability, P s = R. 
where R stands for the system reliability. The probability of system failure is 

Pf = 1 - Ps 

Since the series configuration requires that all units operate successfully for 
system success, the event representing system success is the intersection of X\, 
X 2 , ■■■ ,x n . The probability of this event is given by 


Cause 

Unit 


Unit 


Unit 

Effect 


1 


2 


n 



(a) 


Unit 1 Unit 2 Unit n 

Cause o-»-o-*-o-o-»-o Effect 


(b) 

Figure B1 Series reliability configuration: (a) reliability block diagram (RBD); (b) 
reliability graph. 
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R = P s = P(x iX 2 X 3 ■ ■ ■ X n ) (B1) 

Expansion of Eq. (Bl) yields 

Ps = P(xi)P(x 2 1 Xi )P(x 3 1 X1X2) • • • P(x n IX1X2 • • • x„_ 1) (B2) 

The expression appearing in Eq. (B2) contains conditional probabilities, 
which must be evaluated with care. For example, P(x 2 \ X\X 2 ) is the probabil¬ 
ity of success of unit 3 evaluated under the condition that units 1 and 2 are 
operating. In the case where the power dissipation from units 1 and 2 affects 
the temperature of unit 3 and thereby its failure rate, a conditional probability 
is involved. If the units do not interact, the failures are independent, and Eq. 
(B2) simplifies to 


P s =P( Xl )P(x 2 )P(x 3 )---P(x n ) (B3) 

The reliability of a series system is always smaller than the smallest reli¬ 
ability of the set of components. The lowest relability in the series is often 
referred to as “the weakest link in the chain.” An alternative approach is to 
compute the probability of failure. The system fails if any of the units fail, 
and therefore we have a union of events 

Pf = P(x 1 + x 2 + x 3 +- 1 - x„) (B4) 

Expansion of Eq. (B4) yields 

Pf = [P(x\) + P(x 2 ) + P(x 3 ) + ■ ■ ■ + P(x„)] 

- fP(XiX 2 ) + P(XiX 3 ) + • • • + P(XjXj)\ 

i*j 

+ ••• + (- l)”- 1 [P(xiX 2 ---x„)] (B5) 

Since 

Ps = 1 -Pf (B6) 

the probability of system success becomes 


Ps = 1 - P(x\) - P(x 2 ) - P(xj) - P(x„) + P(x l )P(x 2 \xi) 

+ P(Xl)P(x 3 | xi) + • • • + P{Xi)P {X:\xj) 

i^j 

-■■■ + (-\) n P(x x )P(x 2 \ Xi) • • • P(x n !*!•••*„_!) (B7) 


The reliability expression in Eq. (B7) is equivalent to that in Eq. (B2) but is 
much more difficult to evaluate because of the many terms involved. Equation 
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(B7) also involves conditional probabilities; for example, P(x 2 \x{x 2 ) is the 
probability that unit 3 will fail given the fact that units 1 and 2 have failed. In 
the case of independence P(x 3 IY 1 X 2 ) becomes P(x 3), and the other conditional 
probability terms in Eq. (B7) simplify, yielding 


p s = 1 - mo - p(x 2 ) - p(x 3 )-m«) 

+ P(Xi)P(x 2) + P(X])P(x 3 ) + ■■■ + P(Xi)P(Xj) 

i^j 

)>'P(X,)P(X2)---P(x n ) (B8) 

Equation (B8) is still more complex than Eq. (B3). It is interesting to note that 
the reliability of any particular configuration may be computed by considering 
either the probability of success or the probability of failure. In a very complex 
structure both approaches may be used at different stages of the computation. 


B2.3 Parallel Configuration 

In many systems several signal paths perform the same operation. If the system 
configuration is such that failure of one or more paths still allows the remaining 
path or paths to perform properly, the system can be represented by a parallel 
model. 

A block diagram and reliability graph for a parallel system are shown in Fig. 
B2. There are n paths connecting input to output, and all units must fail in order 
to interrupt all the paths. This is sometimes called a redundant configuration. 

In a parallel configuration the system is successful if any one of the parallel 
channels is successful. The probability of success is given by the probability 
of the union of the n successful events. 



(a) (b) 

Figure B2 Parallel reliability configuration: (a) reliability block diagram; (b) reli¬ 
ability graph. 
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P s = P(x i + X 2 + x 3 +-t- x n ) (B9) 

Expansion of Eq. (B9) yields 

Ps = [P(x 1 ) + P(x 2 ) + P(x 3 ) + • • • + P(x„)] 

- [P(XiX 2 ) + E(XiX 3 ) + • • • + P (XiXj)] 

i^j 

+ ••• + (— -1)"- I / > (xiJC 2 ---x„) (BIO) 

The conditional probabilties which occur in Eq. (BIO) when the intersection 
terms are expanded must be interpreted properly, as in the previous section 
[see Eq. (B7)]. A simpler formula can be developed in the parallel case if one 
deals with the probability of system failure. System failure occurs if all the 
system units fail, yielding the probability of their intersection. 

Pf = P(X[X 2 X 3 ■■ -x„) (B11) 

where 

P s = l-P f (B12) 

Substitution of Eq. (B11) into Eq. (B12) and expansion yields 

Ps ~ 1 - P(xi)P(x 2 1 Xi )P(x 3 1 X1X2) • • • Elx,, IX1X2 • • • x„ _ 1) (B13) 

If the unit failures are independent, Eq. (B13) simplifies to 

Ps = l -P(xOP(x 2 )---P(x n ) (B14) 


B2.4 An r-out-of-/i Configuration 

In many problems the system operates if r out of n units function, e.g., a bridge 
supported by n cables, r of which are necessary to support the maximum load. 
If each of the n units is identical, the probabiality of exactly r successes out 
of n units is given by Eq. (A21) 


B(r;n,p) 


")pv-py 


for r = 0,1,2 • • • n 


(B15) 


where p is the probability of success of any unit. The system will succeed if r, 
r + 1 • • • n - 1, or n units succeed. The probability of system success is given 
by 
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Figure B3 Reliability graph for a 4-out-of-5 system. 


p *=£ r (j) p k (y-p) n k (Bi6) 

If the units all differ, Eqs. (B 15 ) and (B 16 ) no longer hold, and one is faced 
with the explicit enumeration of all possible successful combinations. One can 
draw a reliability graph as an aid. The graph will have (”) parallel paths. Each 
parallel path will contain r different elements, corresponding to one of the com¬ 
binations of n things r at a time. Such a graph for a four-out-of-five system is 
given in Fig. B 3 . The system succeeds if any path succeeds. Each path success 
depends on the success of four elements: 

P s = P(x 1X2V3X4 + x 1X2X3Y5 + V1X2V4X5 + X1X3X4X5 + X2X3X4X5) (B 17 ) 

Expanding Eq. (B 17 ) involves simplification of redundant terms. For example, 
the term P\(x\x2X3X4)(x\ X2X3X5) | becomes by definition P(x\x 2X3X4X5). Thus, 
the equation simplifies to 

P s = P(x ] X2X3X4) + P{x 1X2X3X5) + P{x 1X2X4X5) + E(XiX3X4X5) 

+ P(x 2X3X4X5) - 4P(XiX2X3X4X5) (B 18 ) 

It is easy to check Eq. (B 18 ). For independent, identical elements Eq. (B 18 ) 
gives P s = 5p 4 - 4p 5 . From Eq. (B 16 ) we obtain 

p - =I, ( t ) pl<1 = (4)" 4 ' 1 " p)l + ( s )" 5 ' 1 ' p) ° 

= 5p 4 - 4p 5 
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B2.5 Fault-Tree Analysis 

Fault-tree analysis (FTA) is an application of deductive logic to produce a 
failure- or fault-oriented pictorial diagram, which allows one to analyze system 
safety and reliability. Various failure modes that can contribute to a specified 
undesirable event are organized deductively and represented pictorially. 

First the top undesired event is defined and drawn. Below this, secondary 
undesired events are drawn. These secondary undesired events include the 
potential hazards and failures that are immediate causes of the top event. 
Below each of these subevents are drawn second-level events, which are the 
immediate causes of the subevents. The process is continued until basic events 
are reached (often called elementary faults). Since the diagram branches out 
and there are more events at each lower level, it resembles an inverted tree. 
The treelike structure of the diagram illustrates the various critical paths of 
subevents leading to the occurrence of the top undesired event. A fault tree 
for an auto braking system example is given in Section B5, Fig. B13. 

Both FTAs and RBDs are useful for both qualitative and quantitative anal¬ 
yses: 

1. They force the analyst to actively seek out failure events (success events) 
in a deductive manner. 

2. They provide a visual display of how a system can fail, and thus aid 
understanding of the system by persons other than the designer. 

3. They point out critical aspects of systems failure (system success). 

4. They provide a systematic basis for quantitative analysis of reliability. 

Often in a difficult practical problem one utilizes other techniques to decom¬ 
pose the system prior to effecting either an RBD or an FTA. 


B2.6 Failure Mode and Effect Analysis 

Failure mode and effect analysis (FMEA) is a systematic procedure for iden¬ 
tifying the modes of failures and for evaluating their consequences. It is a 
tabular procedure which considers hazards in terms of single-event chains and 
their consequences. The FMEA is generally performed on the basis of lim¬ 
ited design information during the early stages of design and is periodically 
updated to reflect changes in design and improved knowledge of the system. 
The basic questions which must be answered by the analyst in performing an 
FMEA are: 

1. How can each component or subsystem fail? (What is the failure mode?) 

2. What cause might produce this failure? (What is the failure mechanism?) 

3. What are the effects of each failure if it does occur? 
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Once the FMEA is completed, it assists the analyst in: 

1. Selecting, during initial stages, various design alternatives with high reli¬ 
ability and high safety potential 

2. Ensuring that all possible failure modes, and their effects on operational 
success of the system, have been taken into account 

3. Identifying potential failures and the magnitude of their effects on the 
system 

4. Developing testing and checkout methods 

5. Providing a basis for qualitative reliability, availability, and safety anal¬ 
ysis 

6. Providing input data for construction of RBD and FTA models 

7. Providing a basis for establishing corrective measures 

8. Performing an objective evaluation of design requirements related to 
redundancy, failure detection systems, and fail-safe character 

An FMEA for the auto braking example is given in Section B5, Table B3. 

B2.7 Cut-Set and Tie-Set Methods 

A very efficient general method for computing the reliability of any system 
not containing dependent failures can be developed from the properties of the 
reliability graph. The reliability graph consists of a set of branches which rep¬ 
resent the n elements. There must be at least n branches in the graph, but there 
can be more if the same branch must be repeated in more than one path (see 
Fig. B3). The probability of element success is written above each branch. The 
nodes of the graph tie the branches together and form the structure. A path has 
already been defined, but a better definition can be given in terms of graph the¬ 
ory. The term tie set, rather than path, is common in graph nomenclature. A tie 
set is a group of branches which forms a connection between input and output 
when traversed in the arrow direction. We shall primarily be concerned with 
minimal tie sets, which are those containing a minimum number of elements. 
If no node is traversed more than once in tracing out a tie set, the tie set is 
minimal. If a system has i minimal tie sets denoted by T\, 7\,..., 7j, then 
the system has a connection between input and output if at least one tie set is 
intact. The system reliability is thus given by 


R = P(T 1 + T 2 + --- + T i ) (B19) 

One can define a cut set of a graph as a set of branches which interrupts 
all connections between input and output when removed from the graph. The 
minimal cut sets are a group of distinct cut sets containing a minimum number 
of terms. All system failures can be represented by the removal of at least one 
minimal cut set from the graph. The probability of system failure is, therefore, 
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Figure B4 Reliability graph for a six-element system. 


given by the probability that at least one minimal cut set fails. If we let Ci, 
C 2 , ■ ■ ■, Cj represent the j minimal cut sets and C ; - the failure of the /th cut set, 
the system reliability is given by 


P f = P{C\ + C 2 + - Cj) 

R = 1 - Pf = 1 - P(Ci + C 2 + ■ ■ ■ + C/) (B20) 

As an example of the application of cut-set and tie-set analysis we consider 
the graph given in Fig. B4. The following combinations of branches are some 
of the several tie sets of the system: 

T1 — X\X 2 T 2 =.*3X4 T 3 =*1*6X4 Ta = X3Y5Y2 T 5 -x\X(,x$x 2 

Tie sets T\, T 2 , T 3, and T4 are minimal tie sets. Tie set T 3 is nonminimal since 
the top node is encountered twice in traversing the graph. From Eq. (B19) 

R = P(T \ + T 2 + T 3 + T4) = P(x\x 2 + X3X4 +X1X6X4 +X3T5Y2) (B21) 


Similarly we may list some of the several cut sets of the structure 

Cl = X\X 2 C 2 =X 2 X4 C 3 = X 1 X 5 X 3 C 4 = X 1 X 5 X 4 

C 5 = X 3 X 6 X! C 6 = X 3 X 6 X 2 


Cut sets Ci, C 2 , C 4 , and C(, are minimal. Cut sets C 3 and C 5 are nonminimal 
since they are both contained in cut set C\. Using Eq. (B20), 


R = \ - P(C\ + C 2 + C 4 + C(f) = 1 - P(x 1 X 3 + X 2 X 4 + X 1 X 5 X 4 + X 3 X 6 X 2 ) (B22) 

In a large problem there will be many cut sets and tie sets, and although Eqs. 
(B19) and (B20) are easily formulated, the expansion of either equation is a 
formidable task. (If there are n events in a union, the expansion of the probabil¬ 
ity of the union involves 2" - 1 terms.) Several approximations which are use¬ 
ful in simplifying the computations are discussed in Messinger and Shooman 
[1967] and in Shooman [1990, p. 138]. 
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B3 FAILURE-RATE MODELS 
B3.1 Introduction 

The previous section has shown how one constructs various combinatorial reli¬ 
ability models which express system reliability in terms of element reliabil¬ 
ity. This section introduces several different failure models for the system ele¬ 
ments. These element failure models are related to life-test results and failure- 
rate data via probability theory. 

The first step in constructing a failure model is to locate test data or plan a 
test on parts substantially the same as those to be used. From these data the part 
failure rate is computed and graphed. On the basis of the graph, any physical 
failure information, engineering judgment, and sometimes statistical tests, a 
failure-rate model is chosen. The parameters of the model are estimated from 
the graph or computed using the statistical principles of estimation, which are 
developed in Section A9. This section discusses the treatment of the data and 
the choice of a model. 

The emphasis is on simple models, which are easy to work with and con¬ 
tain one or two parameters. This simplifies the problems of interpretation and 
parameter determination. Also in most cases the data are not abundant enough 
and the test conditions are not sufficiently descriptive of the proposed usage 
to warrant more complex models. 

B3.2 Treatment of Failure Data 

Part failure data are generally obtained from either of two sources: the failure 
times of various items in a population placed on a life test, or repair reports list¬ 
ing operating hours of replaced parts in equipment already in held use. Expe¬ 
rience has shown that a very good way to prevent these data is to compute and 
plot either the failure density function or the hazard rate as a function of time. 

The data we are dealing with are a sequence of times to failure, but the failure 
density function and the hazard rate are continuous variables. We first compute 
a piecewise-continuous failure density function and hazard rate from the data. 

We begin by defining piecewise-continuous failure density and hazard-rate 
functions in terms of the data. It can be shown that these discrete functions 
approach the continuous functions in the limit as the number of data becomes 
large and the interval between failure times approaches zero. Assume that our 
data describe a set of A items placed in operation at time t = 0. As time progresses, 
items fail, and at any time t the number of survivors is 77 (f). The data density 
function (also called empirical density function) defined over the time interval 
tj < t < tj + At\ is given by the ratio of the number of failures occurring in the 
interval to the size of the original population, divided by the length of the time 
interval 1 


*In general a sequence of time intervals to <t < to + Ato, t\ < t <t\ + A t\, etc., is defined, where 
t\ = to + Ato , t 2 = t\ + A t\, etc. 
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TABLE B1 Failure Data for 10 Hypothetical 
Electronic Components 


Failure Number 

Operating Time, h 

1 

8 

2 

20 

3 

34 

4 

46 

5 

63 

6 

86 

7 

111 

8 

141 

9 

186 

10 

266 


fd(t) 


[n(tj) - n(tj + A tj)]/N 

A U 


for ti < t < ti + A ti 


(B23) 


Similarly, the data hazard rate 2 over the interval t, < t < f, + At, is dehned as 
the ratio of the number of failures occurring in the time interval to the number 
of surx’ivors at the beginning of the time interval , divided by the length of the 
time interval. 


Zd(t) 


[n(tj) - n(tj + A tj)\/n(tj) 

A ti 


for tj < t < ti + A ti 


(B24) 


The failure density function fdif) is a measure of the overall speed at which 
failures are occurring, whereas the hazard rate Zd(t) is a measure of the instan¬ 
taneous speed of failure. Since the numerators of both Eqs. (B23) and (B24) 
are dimensionless, both fd(t) and z,j(t) have the dimensions of inverse time 
(generally the time unit is hours). 

The failure data for a life test run on a group of 10 hypothetical electronic 
components are given in Table Bl. The computation of fd(t) and Zd(t) from 
the data appears in Table B2. 

The time intervals Ati were chosen as the times between failure, and the first 
time interval to started at the origin; that is, to =0. The remaining time intervals 
t, coincided with the failure times. In each case the failure was assumed to 
have occurred just before the end of the interval. Two alternate procedures 
are possible. The failure could have been assumed to occur just after the time 
interval closed, or the beginning of each interval t, could have been dehned as 
the midpoint between failures. In this book we shall consistently use the first 
method, which is illustrated in Table B2. 


^Hazard rate is sometimes called hazard or failure rate. 
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TABLE B2 Computation of Data Failure Density and Data 
Hazard Rate 

Time Interval, Failure Density per Hour, Hazard Rate per Hour, 
h f d (t)(x 10- 2 ) z d (?)(xl(r 2 ) 


0-8 

lOx 8 

= 1.25 

8-20 

1 

= 0.84 

10 x 12 

20-34 

1 

= 0.72 

10 x 14 

34-46 

1 

= 0.84 

10 x 12 

46-63 

1 

= 0.59 

10 x 17 

63-86 

1 

= 0.44 

10 x 23 

86-111 

1 

= 0.40 

10 x 25 

111-141 

1 

= 0.33 

10 x 30 

141-186 

1 

= 0.22 

10 x 45 

186-266 

1 

= 0.13 


10 x 80 


1 

10x8 

1 

9x 12 
1 

8x 14 
1 

7x 12 
1 

6x 17 
1 

5x 23 
1 

4x 25 
1 

3x 30 

1 

2x 45 
1 

1 x 80 


1.25 

0.93 

0.96 

1.19 

0.98 

0.87 

1.00 

1.11 

1.11 

1.25 


Since fd(t) is a density function, we can define a data failure distribution 
function and a data success distribution function by 


“S3 

O 

II 

£ 

(B25a) 

Rd(t) = 1 - F d (t) = 1 - [ fdtt)dfi 

J 0 

(B25b) 


where £ is just a dummy variable of integration. Since the f d(fi) curve is a 
piecewise-continuous function consisting of a sum of step functions, its integral 
is a piecewise-continuous function made up of a sum of ramp functions. 

The functions Fd(t) and R,/(t) are computed for the preceding example by 
the appropriate integration of Fig. B5(a) and are given in Fig. B5(c) and (d). 
By inspection of Eqs. (B23) and (B25b) we see that 
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Figure B5 Density and hazard functions for the data of Table Bl. (a) Data failure 
density functions; (b) data hazard rate; (c) data failure distribution function; (d) data 
success function. 
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n(t ;) 

Rj(ti) = (B26) 

In the example given in Table Bl, only 10 items were on test, and the com¬ 
putations were easily made. If many items are tested, the computation inter¬ 
vals A tj cannot be chosen as the times between failures since the computa¬ 
tions become too lengthy. The solution is to divide the same scale into several 
equally spaced intervals. Statisticians call these class intervals, and the mid¬ 
point of the interval is called a class mark. Graphical diagrams such as Fig. 
B5(a) and (b) are called histograms. 

B3.3 Failure Modes and Handbook Failure Data 

After plotting and examining failure data for several years, people began to rec¬ 
ognize several modes of failure. Early in the lifetime of equipment or a part, 
there are a number of failures due to initial weakness or defects; poor insu¬ 
lation, weak parts, bad assembly, poor fits, etc. During the middle period of 
equipment operation fewer failures take place, and it is difficult to determine 
their cause. In general they seem to occur when the environmental stresses 
exceed the design strengths of the part or equipment. It is difficult to predict 
the environmental-stress amplitudes or the part strengths as deterministic func¬ 
tions of time; thus the middle-life failures are often called random failures? As 
the item reaches old age, things begin to deteriorate, and many failures occur. 
This failure region is quite naturally called the wear-out region. Typical fit) 
and z(t) curves 3 4 5 illustrating these three modes of behavior are shown in Fig. 
B6. The early failures, also called initial failures or infant mortality? appear 
as decreasing zit) and fit) functions. The random-failure, or constant-hazard- 
rate, mode is characterized by an approximately constant zit) and a companion 
/(f) which is approximately exponential. In the wear-out or rising-failure-rate 
region, the z(t) function increases whereas fit) has a humped appearance. 

It is clear that it is easier to distinguish the various failure modes by inspec¬ 
tion of the z(t) curve than it is from the appearance of the fit) function. This 
is one of the major reasons why hazard rate is introduced. Because of the 
monotonic nature of F(t) and R(t) these functions are even less useful in dis¬ 
tinguishing failure modes. 

The curve of Fig. B6(b) has been discussed by many of the early writers on the 
subject of reliability [Carhart, 1953] and is often called the bathtub curve because 
of its shape. The fact that such a hazard curve occurs for many types of equipment 


3 Actually all the failures are random; thus a term such as unclassifiable as to cause would be 
more correct. 

4 We are now referring to continuous hazard and failure density functions, which represent the 
limiting forms of fj(t) and Zd(t) as discussed. 

5 Some of the terms, as well as the concept of hazard, have been borrowed from those used by 
actuaries, who deal with life insurance statistics. 
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Figure B6 General form of failure curves, (a) Failure density; (b) hazard rate. 


has been verified by experience. Also when failed components have been dis¬ 
mantled to determine the reasons for failure, the conclusions have again justified 
the hypothesis of three failure modes. In fact most manufacturers of high-re liabil¬ 
ity components now subject their products to an initial burn-in period of hours 
to eliminate the initial failure region shown in Fig. B6. At the onset of wearout 
at time t 2 , the hazard rate begins to increase rapidly, and it is wise to replace the 
item after ?2 hours of operation. Thus, if the bathtub curve were really a univer¬ 
sal model, one would pretest components for t\ hours, place the survivors in use 
for an additional to - t\ hours, and then replace them with fresh pretested com¬ 
ponents. This would reduce the effective hazard rate and improve the probability 
of survival if burn-in and replacement are feasible. Unfortunately, many types of 
equipment have a continuously decreasing or continuously increasing hazard and 
therefore behave differently. It often happens that electronic components have a 
constant hazard and mechanical components a wear-out characteristic. Unfortu¬ 
nately, even though reliability theory is 4 to 5 decades old, not enough compara¬ 
tive analysis has been performed on different types of hazard models and failure 
data to make a definitive statement as to which models are best for all types of 
components. 

Many failure data on parts and components have been recorded since the 
beginning of formal interest in reliability in the early 1950s. Large industrial 
organizations such as Radio Corporation of America, General Electric Com- 
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pany, Motorola, etc., published handbooks of part-failure-rate data compiled 
from life-test and field-failure data. These data and other information were com¬ 
piled into an evolving series of part-failure-rate handbooks: MIL-HDBK-217, 
217A, 217B, 217C, 217D, 217E, 217F Government Printing Office, Washington, 
DC. Another voluminous failure data handbook is Failure Rate Data Handbook 
(FARADA), published by the GIDEP program, Government Industrial Data 
Exchange Program, Naval Fleet Missile Systems, Corona, CA. The FARADA 
handbook includes such vital information as the number on test, the number of 
failures, and some details on the source of the data and the environment. This 
information allows one to use engineering judgments in selecting failure rates 
from this reference. Many practitioners now use failure databases compiled by 
various telecommunications companies in the United States and worldwide (see 
Shooman [1990, Appendix K], and Section D2.1 of this book). 

In practice, failure rates for various components are determined from hand¬ 
book data, held failure-rate data, or life test data provided by component man¬ 
ufacturers. A large fraction of the components used in modern systems are 
microelectronics circuits. Many are analog in nature; however, even more are 
digital integrated circuits (ICs). One reason that modern electronic equipment is 
very reliable is because these ICs have a very low failure rate. Furthermore, the 
failure rate of ICs increases only slowly as their complexity increases. Anal¬ 
ysis of past failure-rate data allows one to develop a simple model for the 
failure rate of digital integrated circuits, which is very useful for initial reli¬ 
ability comparisons of various designs. 

A curve showing the failure rate per gate versus gate complexity for digital 
integrated IC is given in Fig. B7, which was adapted from Siewiorek [1982]. 
The light solid and light dotted lines in the figure as well as the dark and 
light circles represent data plotted from various sources. The heavy solid lines 
were fitted to the data by the author. Note that the slopes are all approximately 
parallel and that the reliability improved from 1965 to 1975 to 1985. 

The heavy lines are based on the assumption that the failure rate increases 
as the square root of the number of gates: 


\b = Cx (g) 1 / 2 


(B27a) 


where 


\b = is the base failure rate 
C = is a constant 

g = is the number of gates in the equivalent circuit for the chip 

Others use a different failure-rate model, where X// ~ (g)“ and 0.1 < a < 0.3 
[Healey, 2001]. If we express the failure rates per gate, X//, we obtain from 
Eq. (B27a): 


\ b , = \ b /g = C/g 1 / 2 


(B27b) 
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Figure B7 Failure rate per gate as a function of chip complexity for bipolar technol¬ 
ogy. (Adapted from Siewiorek [1982, p. 10, Fig. 1-5]). 


The values of C, determined by fitting the heavy curve to the data for 1965, 
1975, and 1985, are C = 0.32, 0.04, and 0.004. This indicates that IC reliability 
has improved by about a factor of 10 for each decade. More details of this 
model are given in Shooman [1990, pp. 641-644]. 

We now propose a hypothetical explanation for why the failure rate should be 
proportional to the square root of the number of gates if we assume that most of 
the IC failures occur from electromigration —a process that produces projections 
of conducting material growing out of the various areas carrying current on an 
IC chip. These projections grow as a function of current and time. If a projec¬ 
tion touches another projection or current-carrying area on the chip, then shorting 
may occur resulting in a failure of the IC. (Another possible result of electromi¬ 
gration is the forming of voids, creating open circuits, or unacceptable changes in 
line/contact resistances. Note that newer technologies now in use are less prone 
to electromigration failure modes [Bernstein, 2001].) 


Tcam-Flij 
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We can model the time to failure in the following manner: Let 5 (mm) be 
the average spacing between current-carrying elements on the chip and let v 
(mm/hr) be the average speed at which the projections elongate. The average 
time to failure, tf (hr), is then proportional to s/v. In general, tf, s, and v will 
be random variables; however, we can characterize tf by its expected value, 
which is generally called the mean time to failure (MTTF), yielding 

MTTF = Kis/v (B27c) 


where K\ is a proportionality constant computed by taking the expected value 
of the distribution for the ratio of the random variables 5 and v. 

For a fixed chip area. A, we assume that the horizontal spacing Sh is inversely 
proportional to the number of gates across the width of the chip. Similarly, 
the vertical spacing, s v , is inversely proportional to the number of gates along 
the length of the chip. Assuming a square chip and g total gates, the number 
of gates across the width (gates along the length) will be equal to y/g and 
Sh = s v = s, which is given by 


5 = 


K 2 

V? 


(B27d) 


where K 2 is a proportionality constant, which is a function of the fabrication 
techniques used for the IC. 

A substitution of Eq. (B27d) into Eq. (B27c) yields 

K K 

MTTF = — 1 -4- (B27e) 

vV§ 

The simplest failure-rate model is the constant failure-rate model, where X is 
a constant with units failures/hr. For this model, it can be shown that X = 
1 /MTTF. Thus 


vV§ 

“ M2 


(B27f) 


This development therefore leads to a failure rate that is proportional to the 
square root of the number of gates on the chip. The author emphasizes that 
this is a hypothesis, not a proven explanation. 

Another hypothesis for newer chip technologies relates initial yield and fail¬ 
ure rate to residual chip defects. Thus X ~ Area ~ g l [Bernstein, 2001]. 


B3.4 Reliability in Terms of Hazard Rate and Failure Density 

In the previous section, various functions associated with failure data were 
defined and computed for the data given in the examples. These functions 
were Zd(t), f At), Fd(t), and Rd(t). In this section we begin by defining two 
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random variables and deriving in a careful manner the basic definitions and 
relations between the theoretical hazard, failure density function, failure dis¬ 
tribution function, and reliability function. 

The random variable t is defined as the failure time of the item in question. 6 
Thus, the probability of failure as a function of time is given as 

Pit <t) = F(t ) (B28) 

which is simply the definition of the failure distribution function. We can define 
the reliability, which is a probability of success in terms of F(t), as 

R(t) = P s (t ) = 1 - F(t) = P {t > t) (B29) 


The failure density function is of course given by 


dF(t) 

dt 


=f(t) 


(B30) 


We now consider a population of N items with the same failure-time dis¬ 
tribution. The items fail independently with probability of failure given by 
F(t) =1 - R(t) and probability of success given by R(t). If the random vari¬ 
able Nit) represents the number of units surviving at time t, then N(t) has a 
binomial distribution with p = R(t). Therefore, 

P[ N(0 = n] = B[rr,N,R(t)] = f ! [tf(f)]"[l - R(t)f~ n 

n ! (N - n )! 

77 = 0,1,..., (B31) 

The number of units operating at any time t is a random variable and is not 
fixed; however, we can compute the expected value N(f). From Table A1 we 
see that the expected value of a random variable with a binomial distribution 
is given by NR(t ) and leads to 

77(f) = E[N(t)\ = NR(t) (B32) 

Solving for the reliability yields 


R(t) = (B33) 

Thus, the reliability at time t is the average fraction of surviving units at time t. 
This verifies Eq. (B27), which was obtained as a consequence of the definition 
of fd(t). From Eq. (B29) we obtain 


6 In some problems, a more appropriate random variable is the number of miles, cycles, etc. The 
results are analogous. 
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Fit) = 1 


nit) 

N 


N - nit) 
N 


and from Eq. (B30) 


fit) 


dFit) 1 dnit) 

dt N dt 


fit) = lim 

At^O 


nit) - nit + At) 
NAt 


(B34) 


(B35) 


Thus, we see that Eq. (B23) is valid, and as N becomes large and Af, becomes 
small, Eq. (B23) approaches Eq. (B35) in the limit. From Eq. (B34) we see 
that Fit) is the average fraction of units having failed between 0 and time t, 
and Eq. (B35) states that /(f) is the rate of change of Fit), or its slope. From 
Eq. (B35) we see that the failure density function /(f) is normalized in terms 
of the size of the original population N. In many cases it is more informative 
to normalize with respect to nit), the number of survivors. Thus, we define the 
hazard rate as 


zft) = - 


lim 

A/—>0 


n(f) - nit + Af) 
n(f)Af 


(B36) 


The definition of zit) in Eq. (B36) of course agrees with the definition of z ( i(t) 
in Eq. (B24). We can relate zit) and/(f) using Eqs. (B35) and (B36): 


zit) 


lim 

Ar—► 0 


nit) - n{t + At) 1 
Af nit) 


N fit) 


1 

nit) 


Substitution of Eq. (B33) yields 


zit) = (B37) 

«(f) 

We now wish to relate 7?(f) to /(f) and to zit). From Eqs. (B29) and (B30) we 
see that 


m = 1 - Fit) 

= 1- [ fiO d£ (B38) 

J o 

where £ is merely a dummy variable. Substituting into Eq. (B37) from Eqs. 
(B35) and (B33), we obtain 
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1 dn(t ) N d 

z(t) = ~— —7 -t— = - — In nit) 

N dt n(t ) dt 


Solving the differential equation yields: 


j 


In n(t) = - 1 z(%)d£ + c 

!o 


where £ is a dummy variable and c is the constant of integration. Taking the 
antilog of both sides of the equation gives: 


h: 


n(t) = e c ex p -I z(0 d!j 


Inserting initial conditions 


gives 


w(0) =N = e c 


fi; 


n(t) = N exp | - 1 z(%)d£ 

'o 


Substitution of Eq. (B33) completes the derivation 


h; 


R(t) = exp | - z(0 d£ 

' o 


(B39) 


Equations (B35) and (B36) serve to define the failure density function and the 
hazard rate, and Eqs. (B37) to (B39) relate R(t) to f(t) and z(t). 7 


B3.5 Hazard Models 

On first consideration it might appear that if failure data and graphs such as 
Fig. B5(a-d) are available, there is no need for a mathematical model. How¬ 
ever, in drawing conclusions from test data on the behavior of other, similar 
components it is necessary to fit the failure data with a mathematical model. 
The discussion will start with several simple models and gradually progress to 
the more involved problem of how to choose a general model which fits all 
cases through adjustment of constants. 


7 An alternative derivation of these expressions is given in Shooman [1990, p. 183], 
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Figure B8 Constant-hazard model: (a) constant hazard; (b) decaying exponential den¬ 
sity function; (c) rising exponential distribution function; and (d) decaying exponential 
reliability function. 


Constant Hazard. For a good many years, reliability analysis was almost 
wholly concerned with constant-hazard rates. Indeed many data have been 
accumulated, like those in Fig. B5(b), which indicate that a constant-hazard 
model is appropriate in many cases. 

If a constant-hazard rate z(t) = X is assumed, the time integral is given by 
\' 0 \ d£ = X t. Substitution in Eqs. (B37) to (B39) yields 

z(t) = X (B40) 

f(t) = \e- Xt (B41) 

R(t) = e~ u = 1 - Fit) (B42) 

The four functions z(t),f(t), F(t), and R(t) are sketched in Fig. B8. A constant- 
hazard rate implies an exponential density function and an exponential reli¬ 
ability function. 

The constant-hazard model forbids any deterioration in time of strength or 
soundness of the items in the population. Thus, if X =0.1 per hour, we can 
expect 10 failures in a population of 100 items during the first hour of operation 
and the same number of failures between the thousandth and thousand and first 
hours of operation in a population of 100 items that have already survived 
1,000 hours. A simple hazard model that admits deterioration in time, i.e., 
wear, is one in which the failure rate increases with time. 




434 


SUMMARY OF RELIABILITY THEORY 





Figure B9 Linearly increasing hazard: (a) linearly increasing hazard; (b) Rayleigh 
density function; (c) Rayleigh distribution function; (d) Rayleigh reliability function. 


Sometimes a test is conducted for N parts for T hours and no parts fail. The 
total number of test hours is H = NT, but the number of failures is zero. Thus 
one is tempted to estimate X by 0/H, which is incorrect. A better procedure is 
to say that the failure rate is less than if one failure occurred, X < l/H. More 
advanced statistical techniques suggest that X = (1/3)/// [Welker, 1974]. 

Linearly Increasing Hazard. When wear or deterioration is present, the haz¬ 
ard will increase as time passes. The simplest increasing-hazard model that can 
be postulated is one in which the hazard increases linearly with time. Assuming 
that z(t) = Kt for t > 0 yields 


Z.(t) = Kt 

(B43) 

<N 

1 

II 

(B44) 

R(t ) = e K?/2 

(B45) 


These functions are sketched in Fig. B9. The density function of Eq. (B44) is 
a Rayleigh density function. 

The Weibull Model. In many cases, the z.(t) curve cannot be approximated by 
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a straight line, and the previously discussed models fail. In order to lit various 
z(t) curvatures, it is useful to investigate a hazard model of the form 

z(t) = Kt m for m > -1 (B46) 

This form of model was discussed in detail in a paper by Weibull [1951] and is 
generally called a Weibull model. The associated density and reliability func¬ 
tions are 


f(t) = Kt m e K ‘ m+1 ^ m + 1 ^ (B47) 

R(t) = e- Kt "' +l/(m + l) (B48) 

By appropriate choice of the two parameters K and m, a wide range of 
hazard curves can be approximated. The various functions obtained for typi¬ 
cal values of m are shown in Fig. BIO. For fixed values of in, a change in 
the parameter K merely changes the vertical amplitude of the z.(t) curve; thus, 
z(t)/K is plotted versus time. Changing K produces a time-scale effect on the 
R(t) function; therefore, time is normalized so that r"' + =[K/(m + l)]f m + 1 . 
The amplitude of the hazard curve affects the time scale of the reliability func¬ 
tion; consequently, the parameter K is often called the scale parameter. The 
parameter m obviously affects the shape of all the reliability functions shown 
and is consequently called the shape parameter. The curves m =0 and m = 
1 are constant-hazard and linearly-increasing-hazard models, respectively. It is 
clear from inspection of Fig. BIO that a wide variety of models is possible 
by appropriate selection of K and m. The drawback is, of course, that this 
is a two-parameter model, which means a greater difficulty in sketching the 
results and increased difficulty in estimating the parameters. A three-parame¬ 
ter Weibull model can be formulated by replacing t by t—to, where to is called 
the location parameter. 

B3.6 Mean Time To Failure 

It is often convenient to characterize a failure model or set of failure data by 
a single parameter. One generally uses the mean time to failure or the mean 
time between failures for this purpose. If we have life-test information on a 
population of n items with failure times t\, tz,..., t n , then the MTTF 8 is defined 
by the following equation [see also Eq. (A34)]: 

1 n 

MTTF = — 2 ti (B49) 

n ; = 1 


8 Sometimes the term mean time between failures (MTBF) is used interchangeably with the term 
MTTF; however, strictly speaking, the MTBF has meaning only when one is discussing a renewal 
situation, where there is repair or replacement. See Shooman [1990, Section 6.10], 
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If one is discussing a hazard model, the MTTF for the probability distribu¬ 
tion defined by the model is given by Eq. (A33) as 

MTTF = E{t) = [ tf(t) dt (B50) 

J o 

In a single-parameter distribution, specification of the MTTF fixes the 
parameter. In a multiple-parameter distribution, fixing the MTTF only places 
one constraint on the model parameters. 

One can express Eq. (B50) by a simpler computational expression involving 
the reliability function: 9 


MTTF = 


R(t) dt 


(B51) 


As an example of the use of Eq. (B51) the MTTF for several different haz¬ 
ards will be computed. For a single component with a constant hazard: 


MTTF = 


e- x, dt= £ 




-X 


1 


(B52) 


For a linearly increasing hazard: 


MTTF = 


r(i) 


2VK/2 


e-K> 2 / 2 dt =-^’==J^ 


2 K 


(B53) 


For a Weibull distribution: 


MTTF = 


e -Kt^/(m + 1 ) dt = 


T[(m + 2 )/(m + 1)] 
[K/(m+ Ol'/^'+D 


(B54) 


In Eq. (B52) the MTTF is simply the reciprocal of the hazard, whereas in Eq. 
(B53) it varies as the reciprocal of the square root of the hazard slope. In Eq. 
(B54) the relationship between MTTF, K, and m is more complex (see Table 
Al). 

In many cases we assume an exponential density (constant hazard) for sim¬ 
plicity, and for this case we frequently hear the statement “the MTBF is the 
reciprocal of the failure rate.” The reader should not forget the assumptions 
necessary for this statement to hold. 


9 For the derivation, see Shooman [1990, p. 197], 
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B4 SYSTEM RELIABILITY 
B4.1 Introduction 

The previous two sections have divided reliability into two distinct phases: 
a formulation of the reliability structure of the problem using combinatorial 
reliability, and a computation of the element probabilities in terms of hazard 
models. This section unites these two approaches to obtain the reliability func¬ 
tion for the system. 

When the element probabilities are independent, computations are straight¬ 
forward. The only real difficulties encountered here are the complexity of the 
calculations in large problems. 

B4.2 The Series Configuration 

The series configuration, also called a chain structure, is the most common 
reliability model and the simplest. Any system in which the system success 
depends on the success of all its components is a series reliability configuration. 
Unfortunately for the reliability analyst (but fortunately for the user of the 
product or device), not all systems have this simple structure. 

A series configuration of n items is shown in Fig. B 11(a). The reliability of 
this structure is given by 

R(t) = P(xi,x 2 , ■ ■ .,x n ) = P(xi)P(x 2 \xi)P(x3\xix 2 ) 

■ ■ ■ P{x„\xix 2 ''' x n -i) (B55) 

If the n items x\, X2, ■ ■ ■ ,x n are independent, then 


n 

R(t) = P(x l )P(x 2 )---P(x n )= n P(xt) (B56) 

i=i 

If each component exhibits a constant hazard, then the appropriate component 
model is e ~ x ' r , and Eq. (B56) becomes 



Figure Bll Series (a) and parallel (b) reliability configurations. 
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n / n 

R(t) = FI = exp ( - X X,T 

i= 1 \ i= 1 

Equation (B57) is the most commonly used and the most elementary system 
reliability formula. In practice this formula is often misused (probably because 
it is so simple and does work well in many situations, people have become 
overconbdent). The following assumptions must be true if Eq. (B57) is to hold 
for a system: 

1. The system reliability configuration must truly be a series one. 

2. The components must be independent. 

3. The components must be governed by a constant-hazard model. 

If assumptions 1 and 2 hold but the components have linearly increasing haz¬ 
ards z,(t) =Kjt, Eq. (B56) then becomes 


(B57) 


» , /" v t 2 \ 

R(t)= FI e K ‘‘ /2 = exp (- X ) (B58) 

1 = 1 \ 1 = 1 2 / 


If p components have a constant hazard and n - p components a linearly 
increasing hazard, the reliability becomes 


R(t)=(n H]( ri n< (2/2 ) 

\<=t / \>=p +1 / 

= exp X X,t^ exp ^ - X —j (B59) 


In some cases no simple composite formula exists, and the reliability must 
be expressed as a product of n terms. For example, suppose each component 
is governed by the Weibull distribution, z(t) =K If m and K are different 
for each component, 


n 

R(t) = II exp 

i=t 


—Kjt m ' +1 \ 

m, + 1 j 


- exp 


n 


X 

i = 1 


Kit”' + 1 \ 

m, + 1 / 


(B60) 


The series reliability structure serves as a lower-bound configuration. To 
illustrate this principle we pose a hypothetical problem. Given a collection of 
n elements, from the reliability standpoint what is the worst possible reliability 
structure they can assume? The intutitive answer, of course, is a series struc¬ 
ture. (A proof is given in Shooman [1990, p. 205]; see also Section B6.4 of 
this book.) 
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B4.3 The Parallel Configuration 

If a system of n elements can function properly when only one of the elements 
is good, a parallel configuration is indicated. A parallel configuration of n items 
is shown in Fig. B 11(b). The reliability expression for a parallel system may 
be expressed in terms of the probability of success of each component or, more 
conveniently, in terms of the probability of failure 

R(t) = P(x i +X 2 + -n„) = 1 - P(x 1 X 2 ■ ■ ■x n ) (B61) 

In the case of constant-hazard components, Pj =P{x~i) = 1 - e x '', and Eq. 
(B61) becomes 

n 

R(t ) = 1 - FI (1 - e- Xit ) (B62) 

;=1 

In the case of linearly increasing hazard, the expression becomes 

n 

R(t) = 1 — n (1 — e K ‘‘ 2/2 ) (B63) 

1 = 1 

In the general case, the system reliability function is 

n 

R(t) = 1 - n (1 - e~ Zi(,) ) (B64) 

i= 1 


where Z,(0 = ^ z(0 d£. 

In order to permit grouping of terms in Eq. (B64) to simplify computation 
and/or interpretation, the equation must be expanded. The expansion results 
in 


R(t) = (W Zl + e~ Zl + • • • + e- z ") - (W (Zl +Zz) + ^ (Zl +Z3) + • • •) 

+ (g _ l Zl +Zz +Z3 1 + e ^ Zl +z 2 + Z4) + .. _ ... g-(Zi +Z2 +Z3 + -+z„) 

Note that the signs of the terms in parentheses alternate and that in the first 
set of parentheses, the exponents are all the Zs taken singly; in the second, all 
the sums of Zs taken two at a time; and in the last term, the sum of all the Zs. 
The rth parentheses in Eq. (B65) contain n\/\r\(n— r)!] terms. 

Just as the series configuration served as a lower-bound structure, the par¬ 
allel model can be thought of as an upper-bound structure. 

If we have a system of n elements with information on each element reli¬ 
ability but little or no information on their interconnection, we can bound the 
reliability function from below by Eq. (B56) and from above by Eq. (B64). 
We would in general expect these bounds to be quite loose; however, they do 
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provide some information even when we are grossly ignorant of the system 
structure. 


B4.4 An r-out-of-n Structure 

Another simple structure which serves as a useful model for many reliability 
problems is an r-out-of-n structure. Such a model represents a system of n 
components in which r of the n items must be good for the system to succeed. 
Of course r is less than n. Two simple examples of an r-out-of-n system are 
(1) a piece of stranded wire with n strands in which at least r are necessary 
to pass the required current and (2) a battery composed of n series cells of E 
volts each where the minimum voltage for system operation 10 is rE. 

We may formulate a structural model for an r-out-of-n system, but it is 
simpler to use the binomial distribution if applicable. The binomial distribution 
can be used only when the n components are independent and identical. If the 
components differ or are dependent, the structural-model approach must be 
used. 11 Success of exactly r out of n identical, independent items is given by 


B(r: n ) 


")/d -pi 


(B66) 


where r: n stands for r out of n, and the success of at least r out of n items is 
given by 


P s = X B(k: n ) (B67) 

k = r 

For constant-hazard components Eq. (B66) becomes 

R(t) = Z ^ j e kX, (l - e - Xt ) n - k (B68) 

Similarly for linearly increasing or Weibull components, the reliability func¬ 
tions are 


10 Actually when one cell of a series of n cells fails, the voltage of the string does not become 
(n - \)E unless a special circuit arrangement is used. Such a circuit is discussed in Shooman 
[1990, p. 2.9], 

n The reader should refer to the example given in Eq. (B17) and Fig. B3. 
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R(t) = 


RU) = 


£(z)* w 

y ( n \ -kKt" , + 1 /(m + 1) 

t\k) e 


(1 _ e -Kt 2 /ly-k 


(1 - e 


-Kt m+l /(tn + \)xn-k 


(B69) 

(B70) 


It is of interest to note that for r = 1, the structure becomes a parallel system 
and for r =n the structure becomes a series system. Thus, in a sense series 
and parallel systems are subclasses of an r-out-of-n structure. 


B5 ILLUSTRATIVE EXAMPLE OF SIMPLIFIED AUTO DRUM 
BRAKES 

B5.1 Introduction 

The preceding sections have attempted to summarize the pertinent aspects of 
reliability theory and show the reader how analysis can be performed. This 
section illustrates via an example how the theory can be applied. 

The example chosen for this section is actually a safety analysis. In the 
case of the automobile, the only difference between a reliability and a safety 
analysis is in the choice of subsystems included in the analysis. In the case 
of safety, we concentrate on the subsystems whose failure could cause injury 
to the occupants, other passengers, pedestrians, etc. In the case of reliability 
analysis, we would include all subsystems whose failure either makes the auto 
inoperative or necessitates a repair (depending on our definition of success). 


B5.2 The Brake System 

The example considers the braking system of a typical older auto, without 
power brakes or antilock brakes and excluding the parking (emergency) brake 
and the dash warning light. An analysis at the detailed (piece-part) level is a 
difficult task. The major subsystems in a typical braking system may contain 
several hundred parts and assemblies. 

The major subsystems and approximate parts counts are: pressure differen¬ 
tial valve (8 parts), self-adjusting drum brakes (4x15 parts), wheel cylinder (4 
x 9 parts), tubing, brackets, and connectors (50 parts), dual master cylinder (22 
parts), and master cylinder installation parts (20 parts). Frequently, because of 
lack of data, analysis is not carried out at this piece-part level. Even with scanty 
data, an analysis is still important, since often it will show obvious weaknesses 
of such a braking system which can be improved when redesigning it. In such 
a case, redesign is warranted, based on engineering judgment, even without 
statistics on frequencies of the failure modes. The example will be performed 
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TABLE B3 A Simplified Braking System FMECA 


Failure Mode 

Mechanism 

Safety 

Criticality 

Comments 

Ml: Differential 

Leakage or 

Medium 

Reduces 

valve failure 

Blockage 


braking 

(1/2 system) 

M2: Differential 

Leakage affecting 

High 

efficiency 

Loss of 

valve failure 

both front and 


both systems 

(total system) 

M3: Master cylinder 

back systems 
Leakage or 

Medium 

Reduces 

failure 

blockage 


braking 

(1/2 system) 

M4: Master cylinder 

Leakage 

High 

efficiency 

Loss of 

failure 

(front and back) 


both systems 

(total system) 

M5: Drum brakes 

Leakage or 

Medium 

Unbalance of 

self-adjusting 

blockage of 


brakes causes 

M6: Tubing, brackets, 

one assembly 
Leakage or 

Medium 

erratic behavior 
Reduces 

and connectors 

blockage 


braking 

(1/2 system) 

M7: Pedal and 

Broken or 

High 

efficiency 

Loss of 

linkage 

jammed 


both systems 


at a higher level, and will group together all failure mechanisms which cause 
the particular failure mode in question. 


B5.3 Failure Modes, Effects, and Criticality Analysis 

An FMECA 12 for the simplified braking system is given in Table B3. Inspec¬ 
tion of the table shows that the modes which most seriously affect safety are 
modes M2, M4, and M7, and the design should be scrutinized in a design 
review to assure that the probability of occurrence for these modes is mini¬ 
mized. 


B5.4 Structural Model 

The next step in analysis would be the construction of an SBD or an FT. 13 
Assume that a safety failure occurs if modes M2, M4, or M7 occur singly and 


12 Sometimes a column is added to an FMEA analysis which discusses (evaluates) the severity 
or criticality of the failure mode. In such a case, the analysis is called an FMECA. 

13 Safety block diagram or fault tree. 
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x i x 3 x 5 


r^L 




—>[ 






m: 



Figure B12 Safety block diagram for simplified brake example. The notation x, 
means a failure mode for element i does not occur, indicating success of element i. 


modes Ml, M3, M5, and M 6 occur in pairs. (Actually, the paired modes must 
affect both front and rear systems to constitute a failure; but approximations 
are made here to simplify the analysis.) Based on the above assumptions, the 
SBD and FT given in Figs. B12 and B13, respectively, are obtained. 


B5.5 Probability Equations 

Given either the SBD of Fig. B12 or the FT of Fig. B13, an equation can be 
written for the probability of safety (or unsafety), using probability principles. 
Computer programs are used in complex cases; this simplified case can be 
written: 

P s = probability of a safe condition from Fig. B12 

= P[{X2X^Xi){X\Xt,Xs + X1X3X6 + X1X5X6 + X3X5X6)] 

P u = probability of an unsafe condition from Fig. B13 

= P[(x 2 +X4+ xi) + (X1X3 + X1X5 + X1X6 + X3X5 + X3X6 + X5X6)] 


An analysis in more depth would require more detail (and more input data). 
The choice of how much decomposition to lower levels of detail is required in 
an analysis is often determined by data availability. To continue the analysis, 
failure data on the modes Ml, M2, ... , M 6 is required. If, for M3, the failure 
rate were constant and equal to A 3 failures per mile, then the possibility of 
mode M3 occurring or not occurring in M miles would be: 


mode M3 does not occur 
in M miles 


= P(Xi = e 


—X 3 M 


mode M3 does occur in 
M miles 


mP(Xi)=l-e 


,-XjM 


To complete the analysis, the failure-rate data A, are substituted into the 
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equations to determine P(X,) or P(X;): then these terms are substituted into the 
equations for P s or and last, P s or P u is substituted into a system safety 
equation along with probabilities for all subsystems that affect safety. For more 
advanced FT concepts, see Dugan [1996]. 


B5.6 Summary 

The safety analysis consists of: 

1. Decomposing the system into subsystems or piece-parts. 

2. Drawing a safety block diagram (SBD) or fault tree (FT) (computer pro¬ 
grams are available for this purpose). See Appendix D. 

3. Computation of the probability of safety or unsafety from the SBD or 
FT (computer programs are also available for this purpose). 

4. Determining the failure rates of each component element. This is a data 
collection and estimation problem. 

5. Substitution of failure rates into expression of step 3 (also done by com¬ 
puter programs). 

B6 MARKOV RELIABILITY AND AVAILABILITY MODELS 
B6.1 Introduction 

Dependent failures, repair, or standby operation complicates the direct calcu¬ 
lation of element reliabilities. In this section we shall discuss three different 
approaches to reliability computations for systems involving such computa¬ 
tions. The first technique is the use of Markov models, which works well and 
has much appeal as long as the failure hazards z(t) and repair hazards w(t) are 
constant. When z(t) and w(t) become time-dependent, the method breaks down, 
except in a few special cases. (See Shooman [1990, pp. 348-359].) A second 
method, using joint density functions, and a third method, using convolution¬ 
like integrations, are more difficult to set up, but they are still valid when z(t) 
or w(t) is time-dependent. 14 

Some of the Markov modeling programs discussed in Appendix D deal with 
nonconstant hazards. In many cases, there is a paucity of failure-rate data; both 
constant-failure rates and -repair rates are used by default. 

B6.2 Markov Models 

The basic properties of Markov models have already been discussed in Section 
A8. In this section we shall briefly review some of the assumptions necessary 


14 See Shooman [1990, Section 5.8], 
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for formulation of a Markov model and show how it can be used to make 
reliability computations. 

In order to formulate a Markov model (to be more precise we are talking 
about continuous-time and discrete-state models) we must brst debne all the 
mutually exclusive states of the system. For example, in a system composed 
of a single nonrepairable element x\ there are two possible states: sq = X\, in 
which the element is good, and =X\, in which the element is bad. The states 
of the system at t = 0 are called the initial states, and those representing a final 
or equilibrium state are called final states. The set of Markov state equations 
describes the probabilistic transitions from the initial to the final states. 

The transition probabilities must obey the following two rules: 

1. The probability of transition in time At from one state to another is given 
by z{t)At, where z(t) is the hazard associated with the two states in ques¬ 
tion. If all the Zi(tf s are constant, z,(t) = A,-, and the model is called 
homogeneous. If any hazards are time functions, the model is called non- 
homogeneous. 

2. The probabilities of more than one transition in time At are infinitesimals 
of a higher order and can be neglected. 

For the example under discussion the state-transition equations can be for¬ 
mulated using the above rules. The probability of being in state so at time t 
+ At is written P SQ (t + At). This is given by the probability that the system 
is in state sq at ti me t, P so (t), times the probability of no failure in time At, 
1 - z(t)At, plus the probability of being in state j) at time t, P sl (t), times the 
probability of repair in time At, which equals zero. 

The resulting equation is 

Ps 0 (t + At) = [1 - z(t)At]P S0 (t) + 0 P si (t) (B71) 


Similarly, the probability of being in state ,V| at t + At is given by 

P Sl (t + At) = [z(t)At]P S0 (t) + 1 P si (t) (B72) 

The transition probability z(t)At is the probability of failure (change from state 
Jo to ji), and the probability of remaining in state ,V| is unity. 15 One can summa¬ 
rize the transition Eqs. (B71) and (B72) by writing the transition matrix given 
in Table B4. Note that it is a property of transition matrices that its rows must 
sum to unity. Rearrangement of Eqs. (B71) and (B72) yields 


15 Conventionally, state .n would be called an absorbing state since transitions out of the state are 
not permitted. 
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TABLE B4 State Transition Matrix for a Single 
Element 



Final States 


Initial States 

so 

si 

■SO 

1 - z(t)At 

z(t)At 

si 

0 

1 


P so (t + At) - P so (t ) 
At 

P si (t + At) - P si {t) 
At 


= ~z(t)P S0 (t) 
= z(t)P S0 (t ) 


Passing to a limit as At becomes small, we obtain 

dP S 0 (t) 


dt 


+ z(t)P S0 (t) = 0 
dP Sl (t) 


dt 


= z(t)P s Jt ) 


(B73) 

(B74) 


Equations (B73) and (B74) can be solved in conjunction with the appropriate 
initial conditions for P so (t) and P Sl (t), the probabilities of ending up in state .s’o 
or state si, respectively. The most common initial condition is that the system 
is good at t = 0, that is, P so (t = 0) = 1 and P si (t = 0) = 0. Equations (B73) and 
(B74) are simple first-order linear differential equations which are easily solved 
by classical theory. Equation (B73) is homogeneous (no driving function), and 
separation of variables yields 


dPJt) 

P so (t) 


= -z(t) dt 


In P so (t) = - t z(0 d^ + Ci 

J o 

P SQ (t) = exp | | z(t)dt + C x 


h: 


= C 2 exp - z(t) dt 


(B75) 


Inserting the initial condition P SQ (t = 0) =1, 


Tcam-Flij 
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PJt = 0) = 1 = C 2 e-° 

C 2 = 1 


and one obtains the familiar reliability function 


R(t) = P so (t ) = exp | - | z(£) 

Formal solution of Eq. (B76) proceeds in a similar manner. 


(B76) 


P SI (t) = 1 - exp 



z(£) d£ 


(B77) 


Of course a formal solution of Eq. (B74) is not necessary to obtain Eq. 
(B77), since it is possible to recognize at the outset that 

P S0 (t) + P. n (t)=L 


The role played by the initial conditions is clearly evident from Eq. (B75). 
Since C 2 =P so (0), if the system was initially bad, P so (t) =0, and R(t) =0. If 
there is a fifty-fifty chance that the system is good at t =0, then P so (f) = \, 
and 


R(t) = \ exp | - j z(£)</£ 

This method of computing the system reliability function yields the same 
results, of course, as the techniques of Sections B3 to B5. Even in a single-ele¬ 
ment problem it generates a more general model. The initial condition allows 
one to include the probability of initial failure before the system in question 
is energized. 

B6.3 Markov Graphs 

It is often easier to characterize Markov models by a graph composed of nodes 
representing system states and branches labeled with transition probabilities. 
Such a Markov graph for the problem described by Eqs. (B73) and (B74) or 
Table B4 is given in Fig. B14. Note that the sum of transition probabilities 
for the branches leaving each node must be unity. Treating the nodes as sig¬ 
nal sources and the transition probabilities as transmission coefficients, we can 
write Eqs. (B73) and (B74) by inspection. Thus, the probability of being at any 
node at time t + At is the sum of all signals arriving at that node. All other 
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1 -z(t)At 1 



Figure B14 Markov graph for a single nonrepairable element. 


nodes are considered probability sources at time t, and all transition probabili¬ 
ties serve as transmission gains. A simple algorithm for writing Eqs. (B73) and 
(B74) by inspection is to equate the derivative of the probability at any node 
to the sum of the transmissions coming into the node. Any unity gain factors 
of the self-loops must first be set to zero, and the At factors are dropped from 
the branch gains. Referring to Fig. B14, the self-loop on P S] disappears, and 
the equation becomes P n = zR a - At node P so the self-loop gain becomes — z, 
and the equation is P so = — zP SQ . The same algorithm holds at each node for 
more complex graphs. 


B6.4 Example —A Two-Element Model 

One can illustrate dependent failures , 16 standby operation, and repair by dis¬ 
cussing a two-element system. For simplicity repair is ignored at first. If a 
two-element system consisting of elements x\ and xi is considered, there are 
four system states: so =X\X 2 , M =x\x 2 , $2 = x 1 X 2 , and S 3 =x \X 2 . The state 
transition matrix is given in Table B5 and the Markov graph in Fig. B15. 

The probability expressions for these equations can be written by inspection, 
using the algorithms previously stated. 


dPs 0 (t) 

dt 

dP sl (t) 

dt 

dP S2 (t) 

dt 

dP Si (t) 

dt 


[20l(O + 202(O]f , .v 0 (O 

(B78) 

[zi30)]T , si(0 + bot(0]^o(0 

(B79) 

[Z 23 (t)]P S 2 (t) + [Z02(Ol 

(B80) 

:i3(01^i(0 + [223(01^(0 

(B81) 


16 For dependent failures, see Shooman [1990, p. 235], 
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TABLE B5 State Transition Matrix for Two Elements 


Initial States 




Final States 





so 

si 

S2 

S3 

Zero failures 

so 

1 - 

- UotO) + Z 02 (t)]At 

201 (0 At 

202 (0 At 

0 

One failure 

si 

0 


1 - UisWlAf 

0 

213(t)At 


S2 

0 


0 

1 - [ 223 ( 1 )] At 

223(t) At 

Two failures 

S3 

0 


0 

0 

1 


The initial conditions associated with this set of equations are P so ( 0), P S1 (0), 

p S2 (0), /vo). 

It is difficult to solve these equations for a general hazard function z(t), but 
if the hazards are specified, the solution is quite simple. If all the hazards are 
constant, z 0 i(0 = Xi, Zoi(t) = Xi, 213 (f) = ^ 3 , and Zi 3 (t)= X 4 . The solutions are 


1 -z n (t)At 
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P so (t)=e^ 1+X2)t (B82) 

P si (t) = t -(^ X3? - ^ (Xl +X2) ') (B83) 

Ai + A2 - A3 

P S2 (t) = a - v 2 A ^ ~ e_(Xl +X2> ') (B84) 

Ai + A2 - A4 

P S3 (t) = 1 - [P so (t) + P S] ( t) + P S2 (t )] (B85) 

where 

P. so (0)=l and />,, (0) - /VO) - P,,(0) - 0 

Note that we have not as yet had to say anything about the configuration of 
the system, but only have had to specify the number of elements and the tran¬ 
sition probabilities. Thus, when we solve for P so , P n , P„ , we have essentially 
solved for all possible two-element system configurations. In a two-element 
system, formulation of the reliability expressions in terms of P SQ , P SI , and P S2 
is trivial, but in a more complex problem we can always formulate the expres¬ 
sion using the tools of Sections B3 to B5. 

For a series system, the only state representing success is no failures; that 
is, P so (t). Therefore 


R(t) = P S0 (t) = e- {Xi+X2)t 


(B86) 


If the two elements are in parallel, one failure can be tolerated, and there are 
three successful states, P so (t), P si (t), P S2 (t)- Since the states are mutually exclu¬ 
sive, 


R(t) = P so (t) + P S1 (0 + P S2 (t) = e~ (Xl +X2)I 


+ 


+ 


Xi 

Xj + X 2 - X 3 
X 2 

Xi + X 2 - X 4 


_ g-(Xi+\2)f^ 
^-\4 1 _ e -Cki+\ 2 )t^ 


(B87) 


It is easy to see why a series configuration of n components has the poorest 
reliability and why a parallel configuration has the best. The only successful 
state for a series system is where all components are good; thus, R(t) =P so (t). 
In the case of a parallel system, all states except the one in which all compo- 
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nents have failed are good, and R(t) = P st] (t) + P si (t ) + P S2 (t). It is clear that 
any other system configuration falls somewhere in between. 


B6.5 Model Complexity 

The complexity of a Markov model depends on the number of system states. In 
general we obtain for an m-state problem a system of m first-order differential 
equations. The number of states is given in terms of the number of components 
n as 



Thus, our two-element model has 4 states, and a four-element model 16 states. 
This means that an ^-component system may require a solution of as many as 
2" first-order differential equations. In many cases we are interested in fewer 
states. Suppose we want to know only how many failed items are present in 
each state and not which items have failed. This would mean a model with n 
+ 1 states rather than 2", which represents a tremendous saving. To illustrate 
how such simplifications affect the Markov graph we consider the collapsed 
flowgraph shown in Fig. B16 for the example given in Fig. B15. Collapsing 
the flowgraph is equivalent to the restriction P/(t) = P si (t) + P S2 (t) applied to 
Eqs. (B78) to (B81). Note that this can collapse the flowgraph only if Z 13 = Z 23 ', 
however, zoi and z .02 need not be equal. These results are obvious if Eqs. (B79) 
and (B80) are added. 

Markov graphs for a system with repair are shown in Fig. B 17(a) and (b). 
The graph in Fig. B 17(a) is a general model, and that of Fig. B 17(b) is a 
collapsed model. 

The system equations can be written for Fig. B 17(a) by inspection using the 
algorithm previously discussed. 


1 — z f j, (OAr l-z[ 2 (t)At 1 



where Zoj(r)= z ol {t) + z 02 (t) 
z' n (f)= z u (t) + z 23 {t) 


Figure B16 Collapsed Markov graph corresponding to Fig. B15. 
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1 - [Zi 3 (0 + w l0 (t)]At 



1-Zoi (0 1-UJ 2 (0+ w[ 0 (.t)]At 1 



(b) 


Figure B17 Markov graphs for a system with repair: (a) general model; (b) collapsed 
model. 
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1 - XA t 


1 


0 


XA t 


0 


S = X 


Figure B18 Markov graph for the reliability of a single component with repair. 


P s 0 = -(zoi + Z02)Ps 0 + w l0 P sl + WoqPs 2 
Ps 1 = ~(Zl3 + + ZtnPso 

P S2 = -(Z23 + W2o)P S2 + ZoiPso 
Pso — ZnP sl "h Z 22 >PS 2 


(B88) 


Similarly for Fig. B 18(b) 



(B89) 


The solution to Eqs. (B88) and (B89) for various values of the zs and ws will 
be deferred until the next section. 


B7 REPAIRABLE SYSTEMS 
B7.1 Introduction 

In general, whenever the average repair cost in time and money of a piece 
of equipment is a fraction of the initial equipment cost, one considers system 
repair. If such a system can be rapidly returned to service, the effect of the fail¬ 
ure is minimized. Obvious examples are such equipment as a television set, 
an automobile, or a radar installation. In such a system the time between fail¬ 
ures, repair time, number of failures in an interval, and percentage of operating 
time in an interval are figures of merit which must be considered along with 
the system reliability. Of course, in some systems, such as those involving life 
support, surveillance, or safety, any failure is probably catastrophic, and repair 
is of no avail. 
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B7.2 Availability Function 

In order to describe the beneficial features of repair in a system that tolerates 
shutdown times, a new system function called availability is introduced. The 
availability function A(t) is defined as the probability that the system is oper¬ 
ating at time t. By contrast, the reliability function R(t) is the probability that 
the system has operated over the inten’al 0 to t. Thus, if A(250) = 0.95, then 
if 100 such systems are operated for 250 hours on the average, 95 will be 
operative when 250 hours is reached and 5 will be undergoing various stages 
of repairs. The availability function contains no information on how many (if 
any) failure-repair cycles have occurred prior to 250 hours. On the other hand, 
if R( 250) = 0.95, then if 100 such systems are operated for 250 hours, on the 
average, 95 will have operated without failure for 250 hours and 5 will have 
failed at some time within this interval. It is immaterial in which stage of the 
first or subsequent failure-repair cycles the five failed systems are. Obviously 
the requirement that R( 250) = 0.95 is much more stringent than the requirement 
that A(250) = 0.95. Thus, in general, R(t) <A(t). 

If a single unit has no repair capability, then by definition A(t ) = R(t). If we 
allow repair, then R(t) does not change, but A(t) becomes greater than R(t). 
The same conclusions hold for a chain structure. The situation changes for any 
system involving more than one tie set, i.e., systems with inherent or purposely 
introduced redundancy. In such a case, repair can beneficially alter both the 
R(t) and A(t) functions. This is best illustrated by a simple system composed 
of two parallel units. If a system consists of components A and B in parallel 
and no repairs are permitted, the system fails when both A and B have failed. 
In a repairable system if A fails, unit B continues to operate, and the system 
survives. Meanwhile, a repairer begins repair of unit A. If the repairer restores 
A to usefulness before B fails, the system continues to operate. The second 
component failure might be unit B, or unit A might fail the second time in a 
row. In either case there is no system failure as long as the repair time is shorter 
than the time between failures. In the long run, at some time a lengthy repair 
will be started and will be in progress when the alternate unit fails, causing 
system failure. It is clear that repair will improve system reliability in such a 
system. It seems intuitive that the increase in reliability will be a function of 
the mean time to repair divided by the MTTF. 

To summarize, in a series system, repair will not affect the reliability expres¬ 
sion; however, for a complete description of system operation we shall have 
to include measures of repair time and time between failures. If the system 
structure has any parallel paths, repair will improve reliability, and repair time 
and time between failures will be of importance. In some systems, e.g., an 
unmanned space vehicle, repair may be impossible or impractical. 17 


'Technology is rapidly reaching the point where repair of an orbiting space vehicle is practical. 
The Hubble Space Telescope has already been repaired twice. 
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1 - XA t 1 - juAf 



Figure B19 Markov graph for the availability of a single component with repair. 


B7.3 Reliability and Availability of Repairable Systems 

As long as the failure and repair density functions are exponential, i.e., 
constant-hazard, we can structure Markov repair models, as done in the pre¬ 
vious section. The reliability and availability models will differ, and we must 
exercise great care in assigning absorbing states in a reliability model for a 
repairable system. 

The reliability of a single component x\ with constant failure hazard X and 
constant repair hazard p, can be derived easily using a Markov model. The 
Markov graph is given in Fig. B18 and the differential equations and reliability 
function in Eqs. (B90) and (B91). 

Pso + 0 = 0 

P sl =XPs 0 (B90) 

Ps 0 (0) = l P Si ( 0) = 0 

R(t) = P so (t) = 1 - P si (t) = e- u (B91) 


Note that repair in no way influenced the reliability computation. Element fail¬ 
ure jci is an absorbing state, and once it is reached, the system never returns 
to X\. 

If we wish to study the availability, we must make a different Markov graph. 
State x\ is no longer an absorbing state, since we now allow transitions from 
state X\ back to state X\. The Markov graph is given in Fig. B19 and the dif¬ 
ferential equations and state probabilities in Eqs. (B92) and (B93). The corre¬ 
sponding differential equations are 


Pso T \P S0 — y P$ P n + [xPs\ — XPso 
Psoi 0) = 1 P S1 ( 0) = 0 


(B92) 


Solution yields the probabilities 
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PsoiO = 




X + fi X + [x 


Psi (t) = e- (X+rit 


(B93) 


X + pi X + pi 

By definition, the availability is the probability that the system is good, P so (t ): 


A(t) = P S0 (t) = 


H X 


-(X + njr 


SQ\" / \ -v 

X + [X X + jLt 

The availability function given in Eq. (B94) is plotted in Fig. B20. 


(B94) 


B7.4 Steady-State Availability 

An important difference between A(t) and R(t) is their steady-state behavior. 
As t becomes large, all reliability functions approach zero, whereas availability 
functions reach some steady-state value. For the single component the steady- 
state availability 


A ss (t) = lim A(t ) = pi/(X + pi) (B95a) 

t —^ 00 


In the normal case, the mean repair time 1/pi is much smaller than the time 
to failure 1/X, and we can expand the steady-state availability in a series and 
approximate by truncation: 


Tcam-Flij 
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A ss (t) = A(o o) 


1 , X X 2 . X 

-——-— — 1 — - + -y + • • • — 1 — - 

1 + X/pi [A 2pi- pi 


(B95b) 


The transient part of the availability function decays to zero fairly rapidly. 
The time at which the transient term is negligible with respect to the steady- 
state term depends on X and pi. As an upper bound we know that the term 
e at < 0.02 for t > 4joe, therefore, we can state that the transient term is over 
before t =4 /(X + pc). If pi > X, the transient is over before t =4/pi. The 
interaction between reliability and availability specifications is easily seen in 
the following example. 

Suppose a system is to be designed to have a reliability of greater than 0.90 
over 1,000 hours and a minimum availability of 0.99 over that period. The 
reliability specification yields 


R(t) = e- X, > 0.90 0<f< 1,000 

e i .ooox = 1 _ 10 3^ 0.90 X > 1CT 4 


Assuming A(°o) for the minimum value of the availability, Eq. (B95) yields 

A(°o) = 1 - — = o.99 
pi = 100X = 10 -2 

Thus, we use a component with an MTTF of 10 4 hours, a little over 1 year, and 
a mean repair time of 100 hours (about 4 days). The probability of any failure 
within 1,000 hours (about 6 weeks) is less than 10%. Furthermore, the prob¬ 
ability that the system is down and under repair at any chosen time between 
t =0 and t =10 3 hours is less than 1%. Now to check the approximations. 
The transient phase of the availability function lasts for 4/(10 2 + 10 4 ) ~ 400 
hours; thus the availability will be somewhat greater than 0.99 for 400 hours 
and then settle down at 0.99 for the remaining 600 hours. Since pi is 100X, the 
approximation of Eq. (B95) is valid. Also since \t = 10 4 x 10 3 =10 1 , the 
two-term series expansion of the exponential is also satisfactory. 

The availability function has been defined as a probability function, just as 
the reliability function was. There is another statistical interpretation which 
sheds some light on the concept. Suppose that a large number of system oper¬ 
ating hours are accumulated. This can be done either by operating one sys¬ 
tem for a long time, so that many failure and repair cycles are obtained and 
recorded, or by operating a large number of identical systems (an ensemble) 
for a shorter period of time and combining the data. If the ratio of cumulative 
operating time to total test time is computed, it approaches A(°o) as t A «. 
Actually the data taken during the transient period of availability should be 
discarded to avoid any distortions. In fact if one wished to compute the trans¬ 
ient phase of availability from experimental data, one would be forced to use 
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1 - X'Af 1 - (X + /i)At 1 



where X' = 2X for an ordinary system 
X' = X for a standby system 

Figure B21 Markov reliability model for two identical parallel elements and one 
repairer. 


a very large number of systems over a short period of time. In analyzing the 
data one would break up the time scale into many small intervals and compute 
the ratio of cumulative operating time over the intervals divided by the length 
of the interval. [See Eq. (3.80).] 

In a two-element nonseries system, the reliability function as well as the 
availability function is influenced by system repairs. The Markov reliability 
and availability graphs for systems with two components are given in Figs. 
B21, B22, and B23, and their solution is discussed in Shooman [1990, p. 345]. 


B7.5 Computation of Steady-State Availability 

When only the steady-state availability is of interest, a simplified computa¬ 
tional procedure can be used. In the steady state, all the state probabilities 
should approach a constant; therefore, setting the derivatives to zero yields the 
following: 


1-X'At 1 - (X + n')At 1 



where X' = 2X for an ordinary system 
X' = X for a standby system 
n' = n for one repairer 
n’ = kp for more than one repairer (k > 1) 

Figure B22 Markov reliability model for two identical parallel elements and k 
repairers. 
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1 - X'A t 


1 ~(\ + n’)A t 


I - n"At 



where X' = 2X for an ordinary system 
X' = X for a standby system 
fx' = /x for one repairer 
fL = k^fx for more than one 
repairer (k { > 1) 


jtt" = /u for one repairer 
n" = 2[x for two repairers 
fji" — k~f fx for more than two 
repairers (k 2 > 1) 


Figure B23 Markov availability graph for two identical parallel elements and k 
repairers. 


Ps 0 (t)= P sl (t) = Ps 2 (0 = 0 

This set of equations cannot be solved for the steady-state probabilities, since 
their determinant is zero. Any two of these equations can, however, be com¬ 
bined with the identity 


^ 0 (°°) + + Ps 2 (°°) = 1 (B96) 

to yield a solution. A simpler method for computing steady-state availability 
using Laplace transforms is discussed in Chapter 4. 

So far we have discussed reliability and availability computations only in 
one- and two-element systems. Obviously we could set up and solve Markov 
models for a larger number of elements, but the complexity of the problem 
increases very rapidly as n, the number of elements, increases. (If the elements 
are distinct, it goes as 2", and if identical, as n + 1.) 


B8 LAPLACE TRANSLORM SOLUTIONS OL MARKOV MODELS 

The formulation of a Markov model always leads to a set of first-order dif¬ 
ferential equations. For simple models, these equations are easily solved by 
using conventional differential equation theory. As we add more components, 
however, the model becomes more complex, and when repair is added to the 
model, the equations become coupled, making the solution more difficult. The 
easiest approach to the solution of such equations is through the use of Laplace 
transforms. In addition, the Laplace transform method provides a particularly 
simple method of calculating the mean time to failure (MTTF), the steady- 
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state availability (A ss ), and the initial behavior of the reliability or availability 
functions: R(t —> 0) or A(t —> 0). 

B8.1 Laplace Transforms 

One can appreciate the simplification that the Laplace transform provides in 
the solution of differential equations by analogy to the use of logarithms 
for simplifying certain numerical computations. In the predigital computer 
era, accurate computations to many decimal places for expressions such as 
a = {A x B) /(C x D) depended on lengthy hand computations, cumber¬ 
some mechanical calculators, or logarithms. Using logarithms, the log(o:) = 
log(A) + logfS) — log(C)— log(D). Thus multiplication and division are reduced 
to addition and subtraction of logarithms and a is determined by taking the 
antilogarithm of log(o:). Of course, for high accuracy, log tables with many 
digits of precision are required; such tables were calculated by mathemati¬ 
cians during the Depression of the 1930s as part of Franklin D. Roosevelt’s 
New Deal programs used for creating jobs. Logarithm tables up to 10 or 16 dig¬ 
its appear in Abramowitz [1972, pp. 95-113]. The concept is to use logarithms 
to convert multiplication and division to simpler addition and subtraction and 
recover the answer by taking antilogarithms. The analogous procedure is to 
use Laplace transforms to convert differential equations to algebraic equations, 
solve the simpler algebraic equations, and use inverse Laplace transforms to 
recover the answer. The Laplace transform will now be introduced as an aid in 
the solution of differential equations. The Laplace transform of a time function 
/(f) is defined by the integral 18 


L {fit )} = F(s) = {fit )}*=/*(/> = /(*)*"" dt (B97) 

Jo 

Four equivalent sets of notation for the Laplace transform are given in Eq. 
(B97). The first two are the most common, but they will not always be used 
since the symbol F(s) causes confusion when we take the Laplace transform 
of both a density and a distribution function in the same equation. The third 
and fourth notation will be used whenever confusion might arise. The asterisk 
or the change in argument from t to s (or both) symbolizes the change from 
the time domain to the transform domain. The utility of the Laplace transform 
is that it reduces ordinary constant-coefficient linear differential equations to 
algebraic equations in .v which are easily solved. The solution in terms of s 
is then converted back to a time function by an inverse-transform procedure. 
Sometimes the notation L x is used to denote the inverse-transform procedure; 
thus one could write A -1 {F(.v)} =/(f). A pictorial representation of the Laplace 
transform solution of a differential equation is given in Fig. B24. 


18 The Laplace transform is defined only over the range of s for which this integral exists. 
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(Time domain) -*—|—► (Transform domain) 



Algebraic 

manipulation 


(Time domain) <—l—► (Transform domain) 


Figure B24 The solution of a differential equation by Laplace transform techniques. 


Since only a few basic transform techniques will be needed in this book, 
this discussion will be brief and will not touch on the broad aspects of the 
method. The Laplace transforms of three important functions follow. 

Example 1: For the exponential functions f(t) = e at 


e- (a+s)t dt 
o 


(B98) 

The restriction s > — a is necessary in order that the integral not diverge. 


Mf(t)}=f(s)= \ e at e st dt = 

J o 


-e 


-(a + s)t 


s + a 


s + a 


for s > -a 


Example 2: Similarly, for the cosine function 
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fit) = cos at = 


e + e 


[ e ia 1 * r e ia 1 * 

It} + {—I 


n<)= v +t a -> - 


—-— + —-— 
s + ia s - ia 


s 2 + a 2 


for v > 0 
(B99) 


Note that in the above computation two properties were used. The Laplace 
transform is an integral, and the integral of a sum of two time functions is the 
sum of the integrals; thus, the transform of the sum is the sum of transforms. 
This is referred to as the superposition property. Also, the result of Eq. (B98) 
was used for each term in Eq. (B99). 


Example 3: As a third example we consider the unit step function w i (?) and 
the constant unity. When/(f) =1 or/(f) = u \(t) 


f (v)= 1 e - sr dt = 

Jo s 


= — for s > 0 


(B100) 


Note that although a step function and a constant are different functions, their 
Laplace transforms are the same, since over the region from 0 < t < +°° they 
have the same value. Thus, the Laplace transform holds only for positive t. 
We can view a step as the limit of an exponential as we increase the time 
constant, 1 /a —> Equation (B100) could therefore be obtained from Eq. 

(B98) by letting a 0. The transforms for several time functions of interest 
are given in Table B6. 

In order to solve differential equations with Laplace transform techniques 
we must compute the transform of a derivative. This can be done directly from 
Eq. (B97) using integration by parts 


Letting dv=df(t) and u = e sl and integrating by parts, 

= ^ +s | f (t)e U dt 

We first discuss the evaluation of the e sl f(t) term at its upper and lower limits. 
Since the Laplace transform is defined only for functions fit) which build up 
more slowly then e v ' decay [see footnote to Eq. (B97)], lim e s 'f(t) =0. At 

t~> 00 
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TABLE B6 A Short Table of Laplace Transforms 


No. 

fit) 

lfit)}*=f*is) = 
L{fit)}=F(s ) 

1 

Mo (0 

1 

2 

K—1(0 

1 

s 

3 

M -2 (0 

1 

s 2 

4 

e -ct 

1 

s + a 

5 

1 r «- t -at 

1 

in- 1 )! 

(5 + a) n 

6 

sin at 

a 

s 2 + a 2 

7 

cos at 

s 

s 2 + a 2 

8 

e~ ht sin at 

a 

(j + b) 2 + a 2 

9 

e~ ht cos at 

s + b 

(s + b) 2 + a 2 

10 

Ae~ at + Be- br 

(A + B)s + Ab + Ba 
(5 + a)(s + b) 


Note: The functions Uo(t), U-i(t),...,u n (t) are called the singularity 
functions, where u n (t ) is the derivative of u The unit step, u_i(f), 

was already defined as 0 for - < t < 0 and 1 for 0 < t < +°°. The unit 

ramp, u 2 (0, is the integral of the step function (of course, the step is the 
derivative of the ramp). The function »o(f) is the unit-impulse function in 
which the amplitude is the derivative of the step function and is 0 every¬ 
where except t = 0, where it is infinite. The area of the impulse is unity 
at t= 0, as it must be since the step is the integral of the impulse. 


the lower limit we obtain the initial value of the function /(0). 19 The integral 
is of course the Laplace transform itself 




(B101) 


By letting g(t) = d"f(tjdt n it is easy to generate a recursion relationship 


19 The notation/(0) means the value of the function at t — 0. If singularity functions occur at t = 
0, we must use care and write /(0 “), which is the limit as 0 is approaches from the left. 
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d g (t)y = 

dt J \ dt n+l J S \ dt n J 


-/"( 0 ) 


for the second derivation 


(B102) 


{^}* = r/*W- S / (0) -/(0 ) 


(B103) 


Using the information discussed, we can solve the homogeneous differential 
equation 


d 2 y dy 

—V + 5 —j- + 6v = 0 
dt 2 dt 


y(0) = 0 

m = i 


Taking the transform of each term, we have 


OVO) - .sv(0) - >>(0)] + 5[sy*(s) - y(0)] + 6y*0) = 0 
0 2 y*0) - 1] + 5[sy*0)] + 6y*0) = 0 

0 2 + 5s + 6 )y*(s) = 1 


y*(s) 


l 

0 + 2 )(s + 3) 


Using transform 10 in Table B6, 


A + B = 0 3A + 25 = 1 
A = +l B = - 1 

y(t) = e- 2t - e- 3 > 

Suppose we add the driving function e 4 ' to the above example, that is, 

f d 2 y dy 1 * _ 4t * 

{^ + 5 ni +6y S 1 

0 + 2)0 + 3)y*0) - 1 = — 

s +4 

*/ , __ + 5 _ 

• V 0 + 4)0+ 2)0+ 3) 


No transform for this function exists in the table, but we can use partial-fraction 
algebra to reduce this to known results 
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5 + 5 _ \ | -2 

(5 + 4)(s + 2)0 + 3) 5 + 4 5 + 2 5 + 3 

Thus, each term represents an exponential, and 

y(t) = \e- At + Ie~ 2r - 2e~ 3r 

The e 4l term represents the particular solution (driving function), and the e 21 
and e terms represent the homogeneous solution (natural response). 

The partial-fraction-expansion coefficients can be found by conventional 
means or by the following shortcut formula 

N(s) = N(s) = Ai + A 2 + _ _ _ + A n 

D(s) " s + r\ + s + r 2 + + 5 + r n 

11 0 + 0) 

i = 1 


where 


Ai = 


N(s) 

D(s) 


(5 + n ) 


For the above example 


At = 


5 + 5 


A 2 = 

A3 = 


(5 + 4)0 + 2)0 + 3) 
5 + 5 

(5 + 4)0 + 2)0 + 3) 
5 + 5 

(5 + 4)0 + 2)0 + 3) 


0 + 4) 
0 + 2 ) 
0 + 3) 


.v = -4 


s = - 2 


1 

~2 

3 

T 


= -2 


i = -3 


(B104) 


The derivation of Eq. (B104) as well as a similar one for the case of repeated 
roots can be found in any text on Laplace transforms. 

We have already discussed two Laplace transform theorems, superposition 
and derivative property. Some additional ones useful in solving Markov models 
appear in Table B7. 

The first and second theorems have already been discussed. The third theo¬ 
rem is simply the integral equivalent of the differentiation theorems. The con¬ 
volution theorem is important since it describes the time-domain equivalent of 
a product of two Laplace transforms. Theorems 6 and 7 are useful in computing 
the initial and final behavior of reliability functions. 
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TABLE B7 A Short Table of Laplace Transform Theorems 


No. 


Operation 


m 


Hf(t)}=F(.s) 


1 Linearity (superposition) 

property 

2 Differentiation theorems 


fll/lW + a 2fl{t) 

dfjt) 

dt 

d 2 m 

dt 2 

d n m 

dt" 


a\F\(s) + ci^Fiis) 

sF(s)-m 

s 2 F(s) - sf( 0 ) -/( 0 ) 


sL 


r d n \f(t) \ 

1 dt*-' / 


f nl ( 0 ) 


3 

Integral theorems 

\ fit) dt 

Jo 

Fis) 

s 




f° 



j fit) dt 

Fis) f J , 

s 

4 

Convolution theorem 

f /1 ijlflit ~ t) dr 

Jo 

Fiis)F 2 is) 

5 

Multiplication-by-f 

tfit) 

dF(s) 

ds 


property 


6 

Initial-value theorem 

lim f(t) 
t^o 

lim sF(s ) 
s—> 00 

7 

Final-value theorem 

Um fit) 

lim sF(s) 

s—*0 


fit) dt 


Note: The function sF(s) is a ratio of polynomials in problems we shall consider. The roots of the 
denominator polynomial are called poles. We cannot apply the initial- and final-value theorems 
if any pole of sF(s ) has a zero or positive real part. The statement is conventionally worded: the 
initial- and final-value theorems hold only provided that all poles of sF(s) lie in the left half of 
the .v plane. 


B8.2 MTTF from Laplace Transforms 

The MTTF can also be computed from the Laplace transform of R(t). [See Eq. 
(B51).] 

Another form in terms of R (s )—an alternate notation for L{R(t )\—is 
obtained by considering \' () R(t ) dr. Using Theorem 3 of Table B7, we obtain 

L | j R(t) r/rj = (B 105 ) 


Flowever, 


Tcam-Flij 
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MTTF = lim ( R(t) dr 

Jo 


Using Theorem 7 of Table B12, 


MTTF = lim f R(t) dr = lim sL J f R(j) dr \ - lim 

Jo s ^° LJo J 


Thus, 


MTTR = lim R*(s) (B106) 

s—> 0 

The above formula is extremely useful for computing the MTTF for a Markov 
model. 


R%) 

s - 

s 


B8.3 Time-Series Approximations from Laplace Transforms 

One of the objectives of the MTTF computations of the previous section is to 
simplify the algebra involved in obtaining time functions from the transformed 
function. In most practical cases, the transform expression (ratio of two poly¬ 
nomials in s) has a denominator that is of second, third, or higher order. The 
solution requires the factoring of the polynomial (generally requiring numeri¬ 
cal methods) and subsequent partial-fraction expansion. Calculating the MTTF 
from F(s) by taking the limit as given in Eq. (B106), is simple; however, it pro¬ 
vides only partial information. In Section 3.4.1, we discussed approximating 
the system-time function in the high-reliability region by the leading terms in 
the Taylor-series expansion. If this is our objective, and if we have the Laplace 
transform, we can find the Taylor-series coefficients simply and directly with¬ 
out first Ending the time function. 

Any function /(f), whose various derivatives exist, can be expanded in a 
Taylor series: 

/(f)=/(0)+/'(0)f/l! +/"(0)f 2 /2! +/'"(0)f 3 /3! + ••• (B107) 

Where/'"(0) is the third time derivative of/(f) evaluated at f = 0, and similarly 
for/(0),/', and so on. 

Note that the derivatives of /(f) always exist for a reliability function that 
is a linear combination of exponential terms, since all derivatives exist for an 
exponential function. We can rewrite this equation in terms of a set of constant 
Kj, which stand for the time derivatives, and obtain 
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f(t) = K 0 + K l t/l'.+K 2 t 2 /2'. + ---+ Z K„t"/n\ (B108) 

n = 0 


If we take L[f{t)} for the function in Eq. (B108), we obtain a series of simple 
Laplace transforms for the right-hand side of the equation. These transforms 
can easily be obtained from entry no. 5 in Table B6 by setting a = 0, yielding 


L{f(t)}=F(s)=X (B109) 

n = 0 S 


Knowing that/(f) in Eq. (B108) is a reliability function, we know that R( 0) = 
1; thus Kq= 1. We can easily manipulate F(s) into the series form given by Eq. 
(B109) by the simple process of long division. Of course, this method presup¬ 
poses that we have the transform, which is generally true if we are solving a 
Markov model. If we already have the time function, it is probably easier to use 
the expansions discussed in Section 3.4.1 than to first compute the transform 
and use this approach. The following example illustrates the method. 

Let us suppose that in the process of solving a Markov model we obtain 
the following Laplace transform of a reliability function: 20 


R(s) 


s + X + X + fi 

+ [X + + ju , ].9 + XX / 


(B110) 


Performing long division of the numerator and denominator polynomials, we 
obtain 


1 

s 


XV XX[X + V + /] 

s 3 s 4 


s~ + [X + X / + fx']s + XX' 


s + [X + X + /x ] 


(Bill) 


Thus, by using Table B6, entry no. 5 to obtain the inverse transform, that is, 
the expression for R(t) that corresponds to R(s) given in Eq. (Bill), 


R(t) « 1 


XX/ 2 XV[X + V + M]f 3 
2 6 


(B112) 


For a parallel system, X' = 2X and \x = fx, and substitution in Eq. (B112) yields 


20 This example is actually the Laplace transform of the reliability of two parallel elements with 
repair. For hot standby, = 2X, and for cold standby, = X. See Eqs. (3.65a, b). 
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«(()■! + ■■ 


(B113) 


For a standby system, X' = X and ji' = p, and substitution in Eq. (B112) yields 

(B114) 




Comparing Eq. (B113) with (B114), we see from the coefficients of the t 2 term 
that the standby system is superior. 
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PROBLEMS 

Note: Problems B1-B4, B6, and B10-B12 are taken from Shooman [1990]. 

Bl. A series system is composed of n identical independent components. The 
component probability of success is p c and q c = 1 — p c . 

(a) Show that if q c « 1 , the system reliability R is approximately given 
by R ~ 1 - nq c . 

(b) If the system has 10 components and R must be 0.99, how good 
must the components be? 

B2. A parallel system is composed of 10 identical independent components. 
If the system reliability R must be 0.99, what is the minimum component 
reliability? 
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B3. A 10-element system is constructed of independent identical components 
so that 5 out of the 10 elements are necessary for system success. If the 
system reliability R must be 0.99, how good must the components be? 

B4. Draw reliability graphs for the following three reliability block diagrams. 
Note: The probabilities of system success for independent identical units 
are given for each part—(a), (b), and (c)—of Fig. PI. 



(a) P s = 2p 2 + 2p 3 - 5p 4 + 2p 5 (b) P s = 4p 2 - 2p 3 - 4p 4 + 4p 5 -p 6 



(c) P s = 2p + p 2 - 2p 3 - Ip 4 + I4p 5 - 9p 6 + 2p 2 

Figure PI 


B5. Formulate a fault-tree model for the systems given in problem B4. 

B6. Find all the minimal tie sets and cut sets for the three systems in problem 
B4. 

B7. Solve problems B2 and B3. 

B8. Assume numerical values for the axes of Fig. B6, and explain the cost 
trade-offs of bum-in and replacement. 

B9. Check the MTTF computation given in Eq. (B54). 
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BIO. A communication system is composed of a fixed-frequency transmitter 
T i and a fixed-frequency receiver R \. The fixed frequency is/i and both 
receiver and transmitter have constant hazards X. In order to improve the 
reliability, a second receiver and transmitter operating on frequency /X 
are used to provide a redundant channel. Both channels are identical, 
except for frequency. Construct a reliability diagram for the system and 
write the reliability function. In order to improve reliability, a tuning unit 
is added to each receiver so that it can operate at frequency f\ or/ 2 . 
The hazard for each tuning unit is given by X'. Draw the new reliability 
diagram and write the reliability function. Sketch the reliability of the 
improved system and the original two-channel system. Assume that X' = 
0.1X and repeat for X' = 10X. (Use series approximations, if necessary.) 

Bll. Solve for the reliability expression for a three-element standby system 
using a Markov model. All elements are independent identical units with 
constant-hazard X. 

B12. For a single component with repair, R(t) and A(t) are given by Eqs. (B91) 
and (B94). If we specify that R(t\) = 0.9, 

(a) What can you say about A^)? 

(b) How are X and /z constrained if A(t\) > 0.99? 



APPENDIX C 


REVIEW OF ARCHITECTURE 
FUNDAMENTALS 


Cl INTRODUCTION TO COMPUTER ARCHITECTURE 

Most readers of this book probably have an electrical engineering or com¬ 
puter science background and are familiar with the material presented in this 
appendix; thus they can skip it altogether or thumb through it as a refresher. 
However, some readers may have a background in mathematics, operations 
research, or some similar held; for them, this appendix will serve as a concise 
background. The reader is referred to the following references for more detailed 
information: Hill. 1981; Kohavi, 1978; Mano, 1995; Roth, 1995; Shiva, 1988; 
Wakerly, 2001. 

Cl.l Number Systems 

Computers are constructed from switching elements that are two-state devices; 
thus it is common to utilize the binary number system (base 2) for computer 
computation, design of arithmetic algorithms, and construction of computer 
hardware. A number, N, written in radix (base), r, takes on the general poly¬ 
nomial form. 


<- 


N = a„r n + ci n _ i r" 1 + • • • + a\r x + aor° . r 1 + 2 r 2 + • • • 


whole number portion 


| <— fraction portion 


radix 

point 

(Cl) 
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Each number system has r distinct digits; for example, in binary, the two digits 
0 and 1; in decimal, the ten digits 0, 1, 2, 3, 4, 5, 6, 7, 8, and 9. 

One can convert from one number system to another using these basic defi¬ 
nitions. As an example, consider the conversion of a base-2 number to a base- 
10 number: 


( 11110101)2 = 2 7 + 2 6 + 2 5 + 2 4 + 2 2 + 2 ° 

= 128 + 64 + 32 + 16 + 4 + 1 = (245)j 0 


Note that parentheses and base subscripts are commonly used to clarify the 
notation when one is discussing two or more number systems and conversions. 
Similarly, one can convert from base 10 to base 2 by extracting the largest 
powers of 2 that are contained in the base-10 number. Conversion of (245)io 
to (?)2 proceeds as follows: 


245 

117 

53 

21 

5 

5 

1 

1 

■128 

-64 

-32 

-16 

-8 

-4 

-2 

-1 

117 

53 

21 

5 

X 

1 

X 

0 

Yes 

Yes 

Yes 

Yes 

No 

Yes 

No 

Yes 


Thus the subtraction process shows that 245 base 10 contains 2 to the powers 
7, 6, 5, 4, 2, and 0 (the Yes’s), but not 2 3 =8 or 2 1 = 2 (the No’s) that yields 
the binary number 11110101. The references give many simpler algorithms for 
conversion. 

The first twenty numbers in the decimal, binary, octal (base 8), hexadecimal 
(base 16; commonly called Hex), and the binary-coded decimal (BCD) systems 
are given in Table Cl. Since the Hex number system is base 16, we need 
sixteen digit symbols. Clearly, the first ten are the digits 0-9 and the remaining 
six are generally represented by the first six letters of the English alphabet— A, 
B, , F. 

Note from Table Cl that it is easy to convert from binary to octal. One 
divides the binary number into groups of three digits and writes the octal num¬ 
bers (0 to 7) that correspond to the group of three digits. Similarly, one can 
convert from binary to Hex by grouping four digits at a time and using a simi¬ 
lar process. Reverse conversions involve expanding each octal digit into three 
binary digits or each Hex digit into four binary digits. The BCD number sys¬ 
tem uses four binary digits to represent the digital numerals from 1 to 9. This 
is emphasized in Table Cl by the vertical bar used for separating the binary 
digits into groups of four. The advantage of the BCD system is that each deci¬ 
mal digit can be converted by repetition of the same circuit; thus designs based 
on BCD numbers are highly modular. 
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TABLE Cl Number Systems Commonly Used in Computer and Digital Circuit 
Design 


Decimal 

Binary 

Octal 

Hex 

BCD 

0 

00000 

00 

00 

000010000 

1 

00001 

01 

01 

0000 0001 

2 

00010 

02 

02 

0000 0010 

3 

00011 

03 

03 

000010011 

4 

00100 

04 

04 

0000|0100 

5 

00101 

05 

05 

0000 0101 

6 

00110 

06 

06 

0000|0110 

7 

00111 

07 

07 

0000 0111 

8 

01000 

10 

08 

0000|1000 

9 

01001 

11 

09 

0000 1001 

10 

01010 

12 

0A 

0001 0000 

11 

01011 

13 

0B 

0001 0001 

12 

01100 

14 

OC 

0001 0010 

13 

01101 

15 

0 D 

0001|0011 

14 

oino 

16 

0 E 

0001|0100 

15 

01111 

17 

OF 

0001 0101 

16 

10000 

20 

10 

0001|0110 

17 

10001 

21 

11 

0001 0111 

18 

10010 

22 

12 

000111000 

19 

10011 

23 

13 

0001 1001 

20 

10100 

24 

14 

0010 0000 


C1.2 Arithmetic in Binary 

One can discuss at length arithmetic in various bases; however, the algorithms 
can become quite detailed, especially when one considers both positive and 
negative numbers. In the binary number system, the rules for positive numbers 
are quite simple. 

1. The sum of any two binary digits (0 + 0, 0 + 1, 1+0, 1 + 1) is 0 if the 
digits are the same (0 + 0, 1 + 1) and 1 if the digits differ (0 + 1, 1 + 
0). A carry to the next digit is only generated when both digits are 1. 

2. The difference of any two binary digits (0 - 0, 0 - 1, 1-0, 1 - 1) is 0 
if the digits are the same (0 - 0, 1 - 1) and 1 if the digits differ (0 - 1, 
1 - 0). A borrow to the next digit is only generated in the case 0-1. 

3. To multiply two binary numbers, we treat the process just like decimal 
multiplication, forming partial products that are shifted left once each 
time we shift to another bit of the multiplier (number on the bottom). 
The partial products are either 0 or a replica of the multiplicand (number 
on the top); then they are added using binary addition. 

4. Long division of two binary numbers proceeds as in the decimal number 
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system, and each trial divisor is subtracted from the number on the top 

using binary subtraction. 

C2 LOGIC GATES, SYMBOLS, AND INTEGRATED CIRCUITS 

Digital-logic elements used in circuits have evolved over the years. The earliest 
realization of logic elements (logic gates) were relays with multiple contacts, 
which were soon replaced by vacuum tube switches and vacuum tube diode 
circuits. Vacuum tubes were in turn replaced by transistors and semiconductor 
diodes, and, finally, by integrated circuits. Modern-day digital circuits (often 
called chips) are composed of various integrated circuits; some, such as micro¬ 
processors and memory systems, are quite complex, whereas others are simple- 
logic circuits. We will discuss the simple-logic circuits since many more com¬ 
plex circuits can be viewed as interconnection of the simple circuits. Such logic 
circuits realize simple-logic functions such as union, intersection, compliment, 
and so forth (see Appendix A3 for a definition of these logic operations based 
on set theory that applies to both digital logic and probability theory). The 
inputs to the logic gates are called switching variables (represented by letters 
A, B, x, y, etc.), and the output is a switching function,/(x, y). The union of 
two switching variables A and B is written as f(A,B) = A + B, (A U B), and is 
called an OR function', the associated logic gate is an OR gate. Similarly, an 
intersection of A, B is written as f{A,B) = A • B. (A (1 B), and is called an AND 
gate. The complement of A is given by A or A' and is called a NOT gate or an 
inverter. Note the symbols A and A' are used interchangeably in this text. The 
logic symbols for these gates are given in Fig. Cl. These three logic gates as 


OR: /(A, B)=A + B 

A — 3 
B —2 

>- 

/(A, B ) 

AND: /(A, B)=A-B 

A — 

B — 

y 

/(A, B) 

NAND: /(A, B) = (A ■ B ) 

A — 

B — 

y 

/(A, B ) 

NOR: /(A, B) = (A + B) 


/(A, B) 

NOT: /(A, B) =A 

A — 


/(A, B ) 

EXOR: /(A, B)=A@B 


/(A, B ) 


Figure Cl Logic functions and circuit symbols for common logic gates. 
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well as the others given in Fig. Cl are discussed further in the next section. 
The complement can be denoted by a NOT gate or by a small circle shown at 
the output of the logic gate (cf. Fig. C2). 

C3 BOOLEAN ALGEBRA AND SWITCHING FUNCTIONS 

One can define a logic function in terms of its variables, a mapping connect¬ 
ing the values that the variables assume, and the resulting value of the logic 
function. For example, if we have two switching variables, x and y, we can 
write the general form of a two-variable switching function as f(x,y). This is 
similar to the definition of a function in calculus; however, the variables x and 
y are discrete and only take on the values 0 and 1. Thus we can define the 
switching function mapping in terms of the 4 combinations (00, 01, 10, 11) of 
the variables in tabular form. Such a table is called a truth table', truth tables 
for the 6 functions given in Fig. Cl are shown in Table C2. The OR function 
is 1 whenever x or y or both are 1; the AND function is 1 only when both x 
and y are 1. The NAND function is the complement of the AND; the NOR 
function is the complement of the OR. Although the EXOR function is like 
an OR function, it excludes the case where both x and y are 1. The EXOR 
function is 1 whenever the inputs x and y disagree. There is another function 
that is sometimes defined as the complement of the exclusive OR function; this 
is called the coincidence function, which is 1 whenever x and y agree. Note 
that there is an alternate way to denote the NOR and NAND functions shown 
in Fig. C2(a) and (b). A circuit for implementing an EXOR function and the 
logic symbol are shown in Fig. C2(c). 

In constructing the truth tables given in Table C2, we assumed that the prop¬ 
erties of the complement, union, and intersection of Is and Os given in Table 
C3 hold. A more basic treatment [Hill, 1981] develops these relationships from 
the principles of Boolean algebra. However, we will assume that the properties 
of Table C3, as well as the basic Boolean algebra identities given in Table C4, 
have been proven. 


TABLE C2 Truth Tables for the Six Functions in Fig. Cl 


NOT (Inverter) 
Function 



OR. AND, NAND, NOR. EXOR, 
Functions 







f(x,y) 



X fix) = X 

X 

y 

x +y 

x • y 

x • y 

x + y 

x © y 

0 1 

0 

0 

0 

0 

i 

1 

0 

1 0 

0 

t 

1 

0 

i 

0 

1 


l 

0 

1 

0 

i 

0 

1 


l 

t 

1 

1 

0 

0 

0 
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OR Followed by NOT 

O- 2ll- 1 - ' 

AND Preceded by NOTs 


z = (x + y)' 


=o- 


z = {x + y)' 


NOR Symbol 


z = x -y 


x- 

y- 


> 


z = x - y 


Equivalent NOR Symbol 


(a) NOR Gates 


x 

y 




z = ( x-y)' 


AND Followed by NOT 




NAND Symbol 


Equivalent NAND Symbol 


(b) NAND Gates 



EXOR Circuit Composed 
of AND/OR/NOT Gates 




EXOR Symbol 


(c) EXOR Gates 

Figure C2 Equivalent forms for logic functions; equivalent symbols/circuits for (a) 
NOR gates, (b) NAND gates, and (c) EXOR gates. 
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TABLE C3 Properties of 1 and 0 in Boolean 
Algebra 


0= 1 

0 + 0 

= 0 

o 

O 

= 0 

1 = 0 

0 + 1 

= 1 

0 ■ 

■ 1 

= 0 


1 + 0 

= 0 

1 

■ 0 

= 0 


1 + 1 

= 1 

1 

' 1 

= 1 


The identities given in Table C3 and C4 can be used to manipulate Boolean 
expressions. For example, consider the following expression: 

x • _y • z = ? \eta-x-y (Cl) 

Substituting a and applying identity {17} (one of DeMorgan’s laws), 


a ■ z = a + z 


Now, substituting again for a: 


x ■ y■z = x■y + z 

Again, applying identity {17}, we obtain 


x • y ■ z=x + y + z (C2) 

Thus we have proved that one of DeMorgan’s laws applies to three variables. 
(One can show that both of DeMorgan’s laws apply to n variables.) 

Consider another example: We wish to simplify the expression, that is, 
obtain an equivalent expression with fewer terms, fewer variables, or both. 
Note that in the second form of Eq. (C3), the “dots” indicating multiplication 
of Boolean variables have been omitted for brevity, as is usually done. 


/(x, y, z)=x-y-z + x- y- z + x- z = xyz + xyz + xz (C3) 

Applying identity {14} to the first two terms, one obtains 

/(x, y, z) = xy(z + z) + xz 

From identity {5} and {6}, one obtains 

/(x, y, z) = xy( 1 ) + xz = xy + xz (C4) 

The result of our Boolean algebraic manipulation is that Eq. (C4) has two 
terms rather than the three in the original function given in Eq. (C3), and both 
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terms in Eq. (C4) contain only two variables (often called literals). Thus the 
manipulation has transformed the switching function into an equivalent simpler 
form, which would result in a simpler circuit if one tries to build a digital- 
circuit realization of this function (see Section C5). 

Any switching function may be written in one of two standard (canonical) 
forms: the sum-of-products (SOP) form and the product-of-sums (POS) form. 
Either form holds for n variables, but for simplicity we will illustrate by con¬ 
sidering a switching function of three variables in SOP form. The standard 
SOP form is as follows: 


f(x, y, z) = [x y z + x yz + xyz + xy z + xyz + xyz + xyz + xyz] (C5) 

Various combinations of the 8 terms appear in the brackets. All possible com¬ 
binations of these 8 terms represent a three-variable function, which includes 
the degenerate cases of no terms (a null circuit), all terms (always unity), the 8 
functions that contain 1 term each, the 28 functions with 2 terms each, and so 
on, for a total of 256 possible functions. As an example, consider the switch¬ 
ing function composed of first, second, and eighth terms in the bracket of Eq. 
(C5): 


f(x,y,z) = [xyz + xyz + xyz] (C6) 

The number of different switching functions, N, of 3 variables can be computed 
as the number of combinations of 8 terms taken 0 at a time plus the number of 
combinations of 8 terms taken 1 at a time plus the number of combinations of 
8 terms taken 2 at a time, and so on. One can show that the sum of this series is 
given by 2 8 = 256: Expand the binomial (a + b) n using the binomial expansion 
and then let a= b = I: the expression reduces to the series of combinations 
discussed previously. In general, if there are k variables, there are 2 k terms 
within the SOP bracket [cf. Eq. (C5)], and N is given by 

N = 2 lk (C7) 

The 8 terms inside the bracket in the SOP form are called minterms, and an 
inspection of the example given in Eq. (C6) suggests a simplified form of nota¬ 
tion in terms of binary numbers: 


f(x,y,z) = [xyz + xyz + xyz\ = [000 + 001 + 111] = X m( 0,1,7) (C8) 

One would say that the switching function is in SOP form and contains 
minterms 0, 1,7, and one can write the SOP form directly from a truth table 
by including minterms corresponding to each row for which the function is a 
1. For example, the EXOR function given in Table C2 is given by 



484 REVIEW OF ARCHITECTURE FUNDAMENTALS 


f(x, y ) = [01 + 10] = X m( 1,2) = xy + xy (C9) 

The POS form is similar to the SOP form and is illustrated as follows for 3 
variables: 


f(x, y, z) = [(x + y + z) ■ (x + y + z) ■ (x + y + z) ■ (x + y + z) 

• (x + y + z) ■ (x + y + z) • (x + y + z) • (x + y + z) (CIO) 

As with the SOP form, various combinations of the 8 terms appear in the brack¬ 
ets. 

The number N is the same as with SOP, as given by Eq. (C7). The terms 
in the bracket of Eq. (CIO) are called maxterms', the notation is similar to that 
illustrated in Eqs. (C8) and (C9), except that a capital Mis used and that instead 
of the summation symbol, a product symbol is used. One can write the POS 
form from the truth table in a manner similar to that of the SOP, but a maxterm 
is included for each row of the truth table where the function is a 0 and all 
variables are complemented. As an example, consider the EXOR function of 
Table C2, which is given in SOP form in Eq. (C9). The POS form is given by 

g(x, y) = FI M(0,3) = (x + y) • (x + y) 

Complementing all the variables, we obtain the POS form: 

g(x,y) = (x+y) • (x + y) (Cl 1) 

One can show that Eqs. (C9) and (Cl 1) are the same by expanding Eq. (C11) 
and simplifying 


g(x, y) = (x + v) • (x + y) = xx + xy + xy + yy (C12) 

The first and the last terms go to 0, and we have the same expression as Eq. 
(C9). 


C4 SWITCHING FUNCTION SIMPLIFICATION 
C4.1 Introduction 

Digital-logic design begins by formulating the switching function and then 
drawing a logic circuit that implements the design. Sometimes it is possible to 
write the logic function in an equivalent but simpler form to lead to a simpler 
logic circuit. 

The basis of logic simplification is when the union of two logic functions 
occurs where the two functions are identical; however, one contains a logic 
variable, the other contains its complement. For example, 
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f(x, y ) = xyz ■ xyz = xy(z + z) = (xy)( 1 ) = xy (C13) 

One can describe an algebraic simplification process as successive applications 
of the above simplification along with identities {5} and {8} of Table C4 to 
simplify a logic function. However, it is easier to define a graphical process 
called the Karnaugh map (K map) simplification. 

C4.2 K Map Simplification 

The K map method is very useful in simplifying logic functions. It begins 
by constructing a “special matrix,” in which a pair of horizontal or vertical 
adjacent cells represents variable simplification; such a pair means that one 
variable drops out. The elementary logic terms (minterms) that make up the 
logic expression are entered in the map as ones (zeroes are entered in the 
other squares), and adjacencies that signify logic simplification are identified 
by inspection. For two variables, f(x, y), we use a square map as shown in 
Table C5. A similar rectangular map for three variables is shown in Table C6, 
and a larger square map for four variables is shown in Table C7. Note that in 
the three- and four-variable maps, the columns and rows are ordered 00, 01, 
11, 10 to provide the “touching” property, not 00, 01, 10, 11 as blind intuition 
might suggest to do. 

The way one proceeds with the K map method is to expand the function to 
be minimized to include all variables; oftentimes, it is convenient to convert 
to the “binary notation.” (The terms in the expanded function are generally 
called minterms.) Consider the three examples each given in Tables C8, C9, 
and CIO. Note the expansion, binary notation, and the shorthand notation in 


TABLE C5 Two-Variable K Map, fix, y) 



Rules: 

(a) Horizontal or vertical touching of two "1” 
cells means that one variable drops out—the 
one that appears as the variable and its com¬ 
plement. 

(b) All four cells are “1” cells, meaning both 
(two) variables drop out; the function becomes 
unity. 

(c) Diagonal touching does not count. 
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TABLE C6 Three-Variable K Map, fix, y, z.) 



Rules: 

(a) Horizontal or vertical touching of two “1” cells 
means that one variable drops out—the one that 
appears as the variable and its complement. 

(b) Four adjacent cells—four across or down, or four 
in a square—mean two variables drop out; the func¬ 
tion becomes a single variable. 

(c) Diagonal touching does not count. 

(d) All eight cells mean three variables drop out; the 
function becomes unity. 


TABLE C7 Four-Variable K Map, / (w, x, y, z) 



Rules: 

(a) Horizontal or vertical touching of two “1” cells means 
that one variable drops out—the one that appears as 
the variable and its complement. 

(b) Four adjacent cells—four across or down, or four in 
a square—mean two variables drop out; the function 
becomes two variables. 

(c) Eight adjacent cells—two adjacent rows or columns of 
four across or down—mean three variables drop out; 
the function becomes a single variable. 

(d) Diagonal touching does not count. 

(e) All sixteen cells mean four variables drop out; the 
function becomes unity. 
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TABLE C8 Two-Variable K Map Simplification, f(x, y) 



Rules: 

(a) /(x, y) = x' + xy expands to x\y + y') + xy = x'y + x'y' 
+ xy. The shorthand notations become 

00 + 01 + 11 = Z m(0, 1, 3). 
i — minterms —> 

Three minterms shown in the map all appear as (ones) 
in cells 0, 1, 3 —* 00, 01, 11. 

(b) The two touching horizontal terms (00, 01) are circled 
to show that they combine. The literal y drops out since 
it appears as y and y in the terms, yielding x'. 

(c) The two touching vertical terms (01, 11) are circled to 
show that they combine. The literal x drops out since 
it appears as x and x' in the terms, yielding y. 

(d) The simplified expression is x' + y. Note that the min- 
term 01 was used twice in the simplification, which is 
legitimate because/(x, y) =/(x, y) +f(x, y). 


TABLE C9 Three-Variable K Map Simplification,/(x, y, z) 



Rules: 

(a) /(x, y, z) =000 + 001 + Oil + 101 + 111 =2 m(0, 1, 3, 
5, 7). Note the simplified minterm notation. 

(b) One always uses the largest groupings first. The four adja¬ 
cent cells in the center of the map are grouped, elimi¬ 
nating v and x and yielding z. This grouping is said to 
“cover” these four minterms. However, minterm 0 re¬ 
mains. One can take minterm 0 by itself, but further sim¬ 
plification occurs if we group 0 and 1, thereby eliminating 
z and yielding x'y'. 

(c) Simplified function/(x, y, z ) =z + x'y'. 
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TABLE CIO Four-Variable K Map Simplification, / ( w, x, y, z ) 



Rules: 

(a ) f(w, x, y, z) — 2 m{0, 1, 3, 5, 7, 8, 14). Note the simplified minterm 
notation. 

(b) As in the three-variable example, first the four adjacent cells (0001, 0011, 
0101, and 0111) are grouped, thereby eliminating y and x and yielding 
w'z- 

(c) We can group 0000 with 0001 as in the three-variable example, but a 
better move is to cover 1000 so that it touches only one other cell 
(0000). Thus grouping these two yields x’y'z ■ Note that the top and 
bottom edges of all maps “touch,” as do the right and left edges. This 
means the four-variable “square” is mappable on the surface of a torus, as 
are two- and three-variable K maps. 

(d) All ones are covered except for 1110. Unfortunately, this minterm does 
not touch any others (no diagonals are allowed) and must be included 
without simplification as wxyz ■ 

(e) The resulting function is /(w, x, y, z) = w'z = x'y'z + wxyz ■ 


terms of the “sigma notation” shown in the examples. For convenience, the 
prime notation is sometimes used to represent complement. In the four-vari¬ 
able example shown in Table CIO, we can visualize the map as a surface, and 
since the top edge of the map “touches” the bottom edge, we can view the map 
as a cylinder. Furthermore, the left edge and right edge touch so the ends of the 
cylinder are joined, forming a torus (a donut shape). In addition, the four cor¬ 
ners form a grouping, and sometimes there is more than one distinct grouping, 
leading to equivalent and different groupings of the same complexity. 

A K map for five variables v, w, x, y, z can be viewed as two four-variable 
maps: one suspended above the other, with v =1 on the top plane and v = 0 
on the bottom plane, and one where adjacency also holds for cells above and 
below each other. Maps for six or more variables involve a “stack” of four- 
variable maps, which become very complex. Fortunately, they are not often 
needed, and another method—the Quine-McCluskey (QM) method, involv¬ 
ing a series of tables—can be used. The QM method becomes complicated; 
however, computer program implementation exists for large problems (see Hill 
[1981], Kohavi [1978], Mano [1995], Roth [1995], Shiva [1988], and Wakerly 
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[2001]). The physical problem is sometimes such that a particular minterm or 
minterms cannot exist; thus we do not care whether they exist or not. Such 
terms, called don’t-cares, are entered as d in the K map and are treated as 
ones if they help the simplification and as zeroes if they are of no aid. The 
cells in the map that are not ones (minterms) are called maxterms', these are 
generally labeled as zeroes. The zeroes can also be grouped to yield the sim¬ 
plified function called the product-of-sums form (POS) mentioned previously. 
The application of don’t-cares and the POS form are treated in the problems 
at the end of this appendix and also in the references. 


C5 COMBINATORIAL CIRCUITS 
C5.1 Circuit Realizations: SOP 

Once the logic functions are minimized, the designer produces circuits that 
realize the logic function. The resulting minimizations from grouping the ones 
in the K map produce a union of intersection terms (in common engineering 
terms, a sum of products, or SOP). The product terms are developed using 
AND gates, and their outputs feed into an OR gate. If complements of the 
variables are needed as inputs to the AND gates, inverters are required. Thus 
any SOP form can be realized using only [AND, OR, NOT] gates; this set of 
logic functions is called a complete set. There are other combinations of logic 
gates that also form a complete set (e.g., NAND gates or NOR gates). Three 
examples of SOP circuits are shown in Fig. C3. 

C5.2 Circuit Realizations: POS 

As was discussed in the previous section, one can group zeroes in the K map 
and obtain a POS design. The resulting circuit is similar to those in Fig. C3; 
however, instead of a multiple input OR preceded by a number of AND gates, 
the circuit is a group of OR gates followed by an AND gate. For examples, 
the reader is referred to the problems at the end of this appendix. 

C5.3 NAND and NOR Realizations 

Up until now, we have discussed the circuit realizations that all involved the 
complete set of {AND, OR, NOT] gates. Sometimes, it is more convenient 
or simpler to deal with other types of logic gates. It turns out that two other 
complete sets of logic gates are often used: {NAND} and {NOR} gates. We 
state without proof (see the problems at the end of this appendix) that each of 
the AND and OR gates in the SOP form can be replaced with a NAND gate. 
Furthermore, if one does not wish to use an inverter, a two-input NAND gate 
with inputs tied together can suffice. Similarly, in a POS design, we can replace 
OR and AND gates by NOR gates and the inverter by a two-input NOR with 
inputs tied together. With a little more effort, one can also use NAND gates 
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x 

y 



fix, y) =x' + y 


(a) Example 1 


x 

y 

z 



fix, y, z) = x'+y'+z 


(b) Example 2 


w x y z W x' y' z' 



for POS designs and NOR gates for SOP designs. For more details, see the 
following references: Hill [1981], Kohavi [1978], Mano [1995], Roth [1995], 
Shiva [1988], Wakerly [2001], 

C5.4 EXOR 

The standard OR gate has an output equal to one whenever either of the inputs 
has an input of one or both inputs are one. An exclusive OR function (EXOR) 
has an output equal to one whenever either of the inputs has an input of one 
but not when both inputs are one. The switching function can be written as 
fix, y) = xy + x'y. This function has a special logic symbol written as fix , y) = 
x © y. The EXOR function occurs frequently in coding theory and in other 

























































COMBINATORIAL CIRCUITS 491 


application areas. It is easy to recognize in a K map by its “checkerboard” 
pattern of ones and zeroes. In shorthand notation,/(x, y) = 10 + 01 = X m( 1, 
2); for three and four variables, f(x,y, z) = xy'z' +x'yz' +x'y'z+xyz = 100 + 010 
+ 001 + 111 = X m(0, 1, 2, 4, 7); f(w,x,y,z) = wx'y'z' + w'xy'z' + w'x'yz? + 
w'x'y'z + vvxyz' + wxy'z + wx'yz + w'xyz = 1000 + 0100 + 0010 + 0001 + 1110 
+ 1101 + 1011 + 0111 = X m( 0, 1, 2, 4, 7, 8, 11, 13, 14). Note that one of the 
properties of the EXOR function is that in the binary notation, each minterm 
has an odd number of ones. The function is used extensively in Chapter 2 on 
coding. 


C5.5 IC Chips 

Several courses in electrical engineering curricula study in detail the features of 
integrated circuits (also called ICs or “chips”); however, for our purposes, we 
need to know a few facts. Integrated circuits come in a wide variety of device 
types or families that vary in switching speed, power consumption, immunity 
to noise and cost, and other factors. We can illustrate some of the differences 
by focusing on two families: the transistor-transistor logic (TTL) logic family, 
which is the most common and least expensive family, and the complementary 
metal oxide silicon (CMOS, or “sea moss”) family, which is used extensively 
for low-power, portable (i.e., battery and solar cell-powered) applications such 
as calculators. Switching delays range from about 3 to 20 nanoseconds (bil¬ 
lionths of a second), the quiescent power dissipation range from about 0.0025 
to 10 milliwatts, and a cost—depending on the complexity of the circuit and 
the quantity purchased—ranging from 10^ to $2. Within each logic family 
there are subcategories such as the low-power Schottky subfamily, which are 
TTL (LSTTL) circuits with a lower-than-normal power usage, and fast Schot¬ 
tky TTL (LAST TTL) circuits that switch faster than regular TTL circuits. The 
reader should consult a recent Motorola or Texas Instruments databook and the 
current state of technology for more details. 

The kinds of available logic gate packages depend strongly on the number of 
pins in the package. The simpler ICs come in standard 14- or 16-pin packages 
approximately 20 x 6 x 5 mm with 7 or 8 pins 5 mm long on each side. A 
typical 14-pin package has 2 pins devoted to power (typically, Vcc = 5 volts 
and ground, which are generally pins 14 and 7); thus 12 pins are available for 
input and output signals. Lor an inverter, there is 1 input and 1 output pin; 
thus 6 devices come in a standard package, which is called a HEX inverter. 
Lor a two-input gate (AND, OR, NAND, NOR), 2 input pins and 1 output 
pin are required; thus 4 gates can be placed in a package, which is called a 
QUAD two-input gate. Similarly, a three-input gate has 3 gates per package 
and is called a TRIPLE three-input gate. A DUAL four-input gate has 2 gates 
per package (with 2 unused pins). The biggest standard-size gate has a single 
13-input gate. See Table Cll for typical TTL gates. 

Of course, complex integrated circuits such as memories and microproces- 
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TABLE Cll Some Examples of Typical TTL Logic Gates. 
[Reprinted with permission of ON Semiconductor; Motorola, 1992.] 


Hex Inverter MC54/74F04 


Quad 2-Input AND Gate MC54/74F08 


Quad 2-Input OR Gate MC54/74F32 


Quad 2-Input NAND Gate MC54/74F00 


Quad 2-Input NOR Gate MC54/74F02 
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sors may come in large packages and have 50-100 pins. One can consult an 
Intel databook for typical examples of the present state of the art of logic for 
larger ICs. 
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TABLE Cll (Continued) 


Triple 3-Input NAND Gate MC54/74F10 


Dual 4-Input AND Gate MC54/74F21 


13-Input NAND Gate SN54/74LS133 
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C6 COMMON CIRCUITS: PARITY-BIT GENERATORS AND 
DECODERS 

C6.1 Introduction 

The short discussion of digital integrated circuits in the preceding section may 
have left the reader with the notion that there are only small IC packages, 
such as QUAD two-input AND gates (called small-scale integration, or SSI) 
and large-memory and microprocessor chips (called large-scale integrated cir¬ 
cuits, or LSI, or very large scale integrated circuits, or VLSI). Such is not 
the case, however, and IC designers have been active for several decades pro¬ 
ducing medium-scale integrated (MSI) circuits. The MSI devices are available 
for a large range of practical functions that could be built out of ICs but at a 
greater cost and size. In essence, it is easier and cheaper to do the wiring and 
constructing on the IC chip rather than externally. Formally, we can classify 
the scale of ICs in terms of the number of gates in their equivalent circuits: 1 < 
SSI < 20; 20 < MSI < 200; 200 < LSI < 200,000; and VLSI > 200,000 (some 
say VLSI > 500,000). We will discuss two MSI devices: a parity-bit generator 
and a decoder, both of which were used in Chapter 2. 
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C6.2 A Parity-Bit Generator 

In Chapter 2, we discussed the use of a parity-bit code to help detect simple 
errors in transmission of digital words. The most common scheme is to add 
check bit to the word so that all the words have an odd number of ones. After 
transmission, we can count the number of ones in the word; if the count is 
even, we know that one error has occurred (actually, an odd number of errors), 
and we signal “transmission error.” Sometimes we set the parity bit so that 
the number of ones in the word is an even number; we call this even parity. 
From our discussion in Section C5.4, we see that EXOR gates could be used to 
accomplish both the generation of a parity bit and the checking of parity for the 
transmitted word. Rather than use a group of AND gates (as shown in Fig. 2.1) 
or a “tree” of EXOR gates (shown in Fig. 2.2), one can use an MSI device—the 
SN74180, a 9-bit odd/even parity generator/checker (see Fig. 2.4). This circuit 
and the similar (newer and faster) 74FS280 shown in Fig. 2.7 and repeated in 
Fig. C4 can be used to compute parity-bit generation or checking for up to 9 
bits. By studying Fig. C4(a) and (b), one can see that inputs 8-13 and 1, 2, 4 
(note that pin 3 is not used; this is denoted as NC = no connection) labeled A, 
B, C, D, E, F, G, H, / are used for up to nine inputs. Outputs 5 and 6 yield 
the EXOR function of all the inputs and its complement; these are labeled by 
equivalent wording of even or odd parity. Suppose that one wishes to check 
the parity of a 16-bit word (labeled as bits 0-15) in a computer circuit. One 
can use two 74FS280 chips in cascade. Bits 0-7 go into inputs A-H of chip 
1 and bits 8-15 go into inputs A-H of chip 2. The output of chip 1 goes into 
input I of chip 2. When an input is not used (e.g., input I of chip 1), it can be 
left unconnected in some logic families, but in others it serves as an “antenna” 
that picks up stray inputs that may interfere with operation. The safest course 
of operation is to connect an unused input to +5 volts or ground, depending on 
the logic family and the function of the input. Unused inputs for a 74FS280 
are connected to ground. The details of Fig. C4(b) are best left to an electrical 
engineer who has studied IC design. 


C6.3 A Decoder 

Another MSI circuit used in Chapter 2 was a decoder. A decoder (sometimes 
called a demultiplexer) converts the binary representation of n variables into 
2" outputs. One can liken the functional operation of the decoder to a selector 
switch such as the one found on the ventilation systems in many automobiles: 
Rotation of the knob switches from “off” to “interior air circulation” to “vent 
air conditioner” to “floor air conditioner” to “outside air vent” and so on. In 
a decoder, the binary input is analogous to the number of clicks of switch 
rotation; which one of the 2" outputs selected is analogous to the air circulation 
function selected. 

For example, the 74FS138 3-to-8 decoder (shown in Fig. 2.7 and repeated 
in Fig. C5) converts the three digits of the input variable A 2 A 1 A 0 (pins 1, 2, 
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9-Bit Parity Genera tor/Checker 
MC54/74F280 


V CC ‘5 J 4 ‘3 1 2 J 1 1 0 

jm r^i r^i [tti m m eo 

) 

4 4 NC 4 z E l 0 gnd 

(a) Diagram 

Logic Diagram 


4 4 4 4 4 4 4 4 4 



Figure C4 A 74LS280 9-bit odd/even parity generator/checker. [Reprinted with per¬ 
mission of ON Semiconductor; Motorola, 1992.] 
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l-of-8 Decoder/Demultiplexer 
MC54/74F138 


Connection Diagram Dip (Top View) 

Vcc o 0 o, d 2 o 3 o 4 o 5 d 6 


0 


LlJ Lil Lil LdJ UJ LfiJ LzJ LsJ 

^0 ^1 ^2 ^1 ^2 ^3 GND 
(a) Diagram 


Logic Diagram 


Logic Symbol 


^2 ^1 ^0 &2 £3 



1 2 3 456 



O 0 Oj 0 2 0 3 0 4 0 5 0 6 0 1 

TTTTTTTT 

15 14 13 12 11 10 9 7 

V cc = PIN 16 
GND = PIN 8 


(b) Functional Block Diagram 

Figure C5 A 74LS138A l-of-8 decoder/demultiplexer (also called a 3-to-8 decoder). 
[Reprinted with permission of ON Semiconductor; Motorola, 1992.] 


3) into the eight possible output combinations (pins 15-7): 000 = AB C, 001 
= ABC ,... ,ABC, which represent the outputs Q to O 7 . The small circle at 
the outputs in the logic diagram indicates that this is the complement of the 
desired output. It is fairly common for integrated circuits to produce the desired 
output, its complement, or both signals. Similarly, the inputs to various inte¬ 
grated circuits may call for the desired signal or its complement. The designer 
keeps track of which signals are needed, alternates complementary outputs and 
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inputs (which cancel), or occasionally uses inverters where needed. There is an 
additional set of three inputs (4, 5, 6; G1, G2A, G2B), which are called enable 
inputs. These inputs are designed for expansion of certain ICs in groups to 
work with larger combinations of inputs and outputs and also serve to connect 
or disconnect all the inputs or outputs. (Note that output of the enable AND 
gate is shown in the logic diagram to be an input to all the output gates so 
it can switch all the outputs on or off.) For example, one can design a 4-to- 
16 decoder by using two 74LS138 3-to-8 decoders. One bank of 8 outputs is 
handled by the first decoder; the other bank of 8 outputs is handled by the 
second decoder. The enable inputs are configured so that they switch on or off 
the appropriate bank. Thus, by inserting the extra variable (fourth input) into 
the G1 input of one decoder and the variable into the G2A input of the other 
decoder that complements the variable, the extra variable switches between the 
two banks. The enable input AND gate has an output if G1 = 1, G2A = 0, and 
G2B = 0. Thus, if any enable input is not used, it must be connected to 1 (5 
volts) if it is G1 or connected to 0 (grounded) if it is the G2A or G2B input. 
In Fig. 2.7, we only need a 3-to-8 decoder; thus none of the enable inputs are 
used, and G1 is connected to +5 volts and G2A, G2B are grounded (0 volts). 
Note that the G1 connection to +5 volts also uses a resistor to limit the current 
into the input to protect it when switching occurs and that the decoder requires 
a 16-pin package. 


C7 FLIP-FLOPS 

Computers and digital circuits are composed of three broad classes of facili¬ 
ties: the computational units called central processing units (CPUs) (frequently 
microprocessors); the memory units (flip-flops, electronic memory, and disk, 
card, tape, CD, and other media); and input/output facilities (keyboards, moni¬ 
tors, communication connections, etc.). The fastest of the memory storage units 
are flip-flops (FFs), which are individually connected or connected in banks 
called registers. Registers, discussed briefly in Chapter 2, are storage devices 
made up of cascaded single-bit FF storage circuits with switching time delays 
of several nanoseconds (about the same as logic gates). 

In addition, there are single-input FFs; among them is the trigger or toggle 
FF (TFF). When the input T is 0, the output Q (and its complement Q') holds 
(stores) its previous state (either a 0 or 1). When T = 1, the values of Q and Q' 
flip from their previous states to the complement (0 or 1). There is also a delay 
FF (D FF), which stores as output whatever the D input is after a switching 
delay. Both the T and D FF are single-input devices. A symbolic diagram of 
a T FF is shown in Fig. C6 along with a state table. A state table is similar to 
a truth table, but it contains an additional column (serving like an additional 
input) that is the previous output. Note the first line of the state table reveals 
that if the previous storage state Q n is 0 and there is no T input (0 input), then 
the new state Q n + \ is the same as the old—that is, 0. The second line in the 
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(b) A 2-Input JK FF 

Figure C6 Diagrams of various flip-flops. [Reprinted with permission of ON Semi¬ 
conductor; Motorola, 1992.] 


state table also reveals that with no input, the stored state does not change and the 
stored value of 1 remains stored. In the last two lines of the state table, the input 
of 1 changes the stored state from 0 to 1 or 1 to 0. In a large circuit with many 
interconnected FFs, there may be unwanted signals that propagate after one or 
more FFs switch. It is desirable to have all the FFs synchronized so that they ini¬ 
tiate switching simultaneously in a short instant of time less than the switching 
time. The C input (clock input) accomplishes this synchronization. A timing sig¬ 
nal, which is a pulse of short duration (less than the switching time), is fed to the 
C input that “opens” the T input and allows it to switch the FF if T= 1. (The inputs 
T and C are essentially fed to a two-input AND gate, and the output of the gate is 
the switching signal.) Other synchronization circuitry and features for clearing a 
stored value or setting a stored value (reset to 0 or preset to unity) are included 
in commercial FFs but need not be discussed here. 

The second part of Fig. C6 shows a two-input JK FF. A 1 signal on the J 
and a 0 signal on the K input set the output Q to 1 regardless of its previous 
storage value (see rows 3 and 6 in the state table). A 1 signal on the K and a 
0 signal on the J input set the output Q to 0 regardless of its previous storage 
value (see rows 2 and 5 in the state table). As a way to remember the function, 
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SN54/74LS73A 

Logic Diagram (Each Flip-Flop) 



Logic Symbol 
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2 6 
V cc = PIN 4 
GND = PIN 11 

(c) A 74LS73A JK FF 

Figure C6 (Continued) 


one can think of J as “jump” and K as “kill.” If both inputs J and K have a 
0 signal, then nothing changes in the output (see rows 1 and 4 in the state 
table). If both the J and K inputs are simultaneously 1, then the FF behaves 
like a T FF, that is, the stored state flips its value (see rows 7 and 8 in the 
state table). There is also another kind of two-input FF called a reset (R) and 
set ( S ) FF (or a reset-set FF) that behaves like a JK FF, except the S = R = 
1 condition is not allowed. (For more details, see the references [Hill, 1981; 
Kohavi, 1978; Mano, 1995; Roth, 1995; Shiva, 1988; Wakerly, 2001].) Some 
designers consider a JK FF a basic design element since it is easily connected 
to behave like a T or D FF. See the problems at the end of this appendix for 
more details. 

The 74LS73A JK FF shown in Fig. C6(c) is essentially the same as the JK 
FF just discussed, but with the following modifications: (a) It is dual—that is, 
two devices are fit into a 14-pin package; (b) it is negative edge-triggered—that 
is, the clock-pulse input opens the J and K inputs when the pulse falls from 1 
to 0; and (c) a 0 signal on the CD-input sets (clears) the Q output to 0 (and, 
of course, the Q' output becomes 1). 

A simple application of a T FF is in a digital-electronic elevator control 
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system. When you are outside the elevator and push the up or down button 
for calling the elevator to a floor and release the button, the signal remains 
stored. An implementation is to connect a power source and a switch in series 
and feed the signal to the T input of a T FF (and also into the C input). The 
call signal will be stored. One FF is needed for the up and one for the down 
inputs. Once the elevator reaches your floor, a floor switch can feed another 
voltage into the T input to switch the call state back to 0. Actually, the circuit 
must know if the elevator is traveling up or down to know which call signal 
to clear (this would require another TFF storage element). 


C8 STORAGE REGISTERS 

Flip-flops can serve as single-bit storage registers; however, they are generally 
organized inside of MSI circuits called registers, such as the 74F195 device (a 
4-bit shift/storage register in a 16-pin package; see Fig. Cl). Up to four bits 
of data can be stored in the register, and there is also shift-right function. If we 
wish to store a 16-bit word, we use a cascade of four such devices. Other stor¬ 
age registers provide more bits of storage (in packages with more pins), both 
shift-right and shift-left operation, and other functions. Generic block diagrams 
of storage/shift registers are shown in Figs. 2.10 and 2.11. 

At the heart of this storage register are four reset-set (RS) FFs that behave 
like JK FFs, where R is like K and S is like J. The clock-pulse (CP) input 
is for a clock pulse that feeds all four FFs. The MR' input is used to reset 
(clear) all the FF outputs— Qo, Qi, Qi, (h —instantly to 0. For convenience, 
the complement of the Q 3 output is supplied. Parallel input of data is provided 
(as with most shift/storage registers) via inputs Do, D\, D 2 , D 2 . We can liken 
parallel loading of the register to four soldiers facing four adjacent doors of 
a barracks (Do, D\, D 2 , D 3 ). When the corporal beats the drum (the CP), the 
four doors open and the soldiers enter the barracks and stand to attention inside 
(Qo, Q\, Q 2 , Q:\), after which the doors close. The analogy for serial loading 
is that the four soldiers are in single file facing a single door (J, K'). When the 
corporal beats the drum (the CP and the shift pulse, or PE), the single door 
opens, and the first soldier enters the barracks and stands to attention inside 
(Qo). As the sound of the drum dissipates, the door closes and the first soldier 
comes to attention (Qo). At the next drumbeat, the soldier inside makes a right 
turn, takes a step forward, makes a left turn, and stands to attention (the soldier 
shifts right from position Qo to QQ; then the door opens and the second soldier 
steps inside (Qo). As the sound of the drum dissipates, the door closes and the 
second soldier stands to attention (Qo) next to the first who is already standing 
to attention (Q\). The process repeats itself for two more drumbeats until all 
four soldiers are in the barracks standing to attention. If there is an attack, 
the burglar sounds the alarm (MR'), causing all four soldiers to run from the 
barracks to arm themselves and leave the barracks empty (reset to 0 ). 

Applications of storage/shift registers are shown in Figs. 2.9, 2.10, and 2.11; 
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Figure C7 Diagrams of a 74F195 shift/storage register. [Reprinted with permission 
of ON Semiconductor; Motorola, 1992.] 


they are also discussed in the problems at the end of this appendix. Some 
shift/storage registers provide additional facilities, such as both shift-right and 
shift-left capability. The reader is referred to the following references for more 
details: Motorola, 1992; Shiva, 1988; and Wakerly, 2001. 
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PROBLEMS 

Cl. Convert the following base-2 number to base 10, base 8, base 16: 
( 1011110101 ) 2 - 

C2. As a check, convert the base-10, base-8, and base-16 numbers obtained 
back to base 2. 

C3. Expand Table Cl for numbers between 20 and 30. 

C4. Prove the identities of Table C4 by substituting all possible combinations 
of ones and zeroes for the variables and computing the results. 

C5. Add the following base-2 numbers and check the results by converting 
to base 10: (1011110101) 2 + (11110101) 2 = ? 

C6. Repeat problem C5 for the subtraction problem: (1011 110101 U - 
( 11110101)2 = ? 

C7. Repeat the problems in Tables C8-C10 using the POS form. To use 
the POS form, draw a K map and enter ones for the maxterms (note 
the minterms are ones in the SOP map and the maxterms are zeroes). 
Proceed to minimize the K map as with the POS form and write the 
minimum function in SOP form. Then complement all variables. 

C8. Which method is better for the problems of Tables C8-C10: the SOP 
method shown or the POS method of problem C7 ? 

C9. Consider the three examples that are each given in Tables C8-C10. 
Assume that you can replace one of the zeroes in each K map. For each 
problem in the tables, choose the best location to insert a don’t-care ( d ) 
and decide whether it should become a 1 or a 0 for best minimization. 
Explain your reasoning. Minimize the function and draw an SOP circuit 
for the minimized function. 
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CIO. Repeat problem C9 for a POS design. (Hint: Study the solution of prob¬ 
lem Cl to learn how to do a POS design.) 

Cll. Prove that the NAND gate is a complete set. (Hint: Since we know that 
the set of AND, OR, NOT gates is a complete set, we can perform our 
proof by showing that we can construct an AND, an OR, and a NOT 
gate from one or more NAND gates. 

C12. Repeat problem Cll and show that a NOR gate is a complete set. 

C13. Draw a circuit for the examples of Tables C8-C10 in the SOP form for 
using only NAND gates. 

C14. Draw a circuit for the examples of Tables C8-C10 in the POS form for 
using only NOR gates. 

C15. Consider the four-variable K map given in Table C7. Fill the entire map 
with a “checkerboard pattern” of ones and zeroes, starting with a 0 in 
the top left-hand corner. Minimize the function and draw an SOP circuit 
using AND, OR, NOT gates. Repeat the circuit using EXOR gates; then 
compare the two circuits. 

C16. Repeat problem C15 for a “checkerboard pattern,” starting with a 1 in 
the top left-hand corner. 

C17. Draw a diagram to show how a 74LS280 IC can be used to check an 
8-bit word for even parity and for odd parity. 

C18. Repeat problem C17 for a 16-bit word. 

C19. Show how to connect two 74LS138 3-to-8 decoders to implement a 
4-to-16 decoder. 

C20. Start with a JK FF and show how the two inputs can be connected to 
operate like a T FF. Explain. 

C21. Start with a JK FF and show how the two inputs can be connected to 
operate like a D FF. (Hint: You will need an inverter to take the com¬ 
plement of one of the inputs.) What will the delay time of the D FF 

be? 

C22. Fill in the details of the elevator-floor-button-control system outlined in 
Section C7 and draw the circuit diagram. Explain the operation. 

C23. Use a 74F195 storage/shift register to design the storage application 
shown in Fig. 2.9. Explain how to connect the inputs and outputs of 
the 74F195. 

C24. Repeat problem C23 for the application shown in Fig. 2.10. 

C25. Repeat problem C23 for the application shown in Fig. 2.11. 



APPENDIX D 


PROGRAMS FOR RELIABILITY 
MODELING AND ANALYSIS 


D1 INTRODUCTION 

Analysis is the theme of this book; indeed, Chapters 1-7 stressed both exact 
and approximate analytical approaches. However, it is clear that a large, prac¬ 
tical system will involve computer analysis. Thus the focus of this appendix 
is to briefly discuss a sampling of the many available computer programs and 
point the reader to references that discuss such programs in more detail. In all 
cases, the intent is to provide a smooth transition from analysis to computation. 
Analysis programs are important for many reasons, including the following: 

1. to replace laborious computation; 

2. to model complex effects that are difficult to solve analytically; 

3. to solve a system that is so large or complex that it is intractable, even 
with approximations and simplifications; 

4. to provide a graphic- and text-based record of the system model under 
consideration; and 

5. to document parameter values and computations based on the model. 

The reader should not underestimate the utility of a well-thought-out computer 
program to aid in documentation to satisfy reasons (4) and (5) of the preceding 
list. A program might be used for documentation even if all the computations 
are done analytically. 

In the early days of reliability analysis, the size of a computational program, 
the speed of the computer, and the size of memory were of prime importance. 
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To put this in perspective, the reader should consider the title, subtitle, and pub¬ 
lication date of the following article written by Schmidt and Busch, two engi¬ 
neers with the GE Controls Department, in The Electronic Engineer. “An Elec¬ 
tronic Digital Slide Rule—If This Hand-Sized Calculator Ever Becomes Com¬ 
mercial, the Conventional Slide Rule Will Become Another Museum Piece” 
[1968]. The authors were speaking of the forerunner of the scientific calculator 
now available in any stationery store or drugstore for $12 (as low as $8 during 
fall back-to-school sales). The first pocket-sized commercial scientific calcu¬ 
lator, introduced in the early 1970s, was the Hewlett-Packard HP-35; it sold 
for about $400! To the author’s knowledge, the first comprehensive reliabil¬ 
ity computation program (one that models repair via solution of the analytical 
equations) was the GEM Markov modeling program developed by the Naval 
Applied Sciences Lab in the late 1960s [Orbach, 1968]. This program used the 
supercomputer of the day—the CDC 6600, and complex problem solutions ran 
one-half to one hour. 

Computer solutions of reliability or availability models fall into a number of 
major classes. All the approaches to be discussed in this appendix are practical 
because of the great power and memory size of modern desktop and laptop 
computers. The main choice hinges on ease of use and cost of the program. 
The least desirable approach is to write a custom analysis program in some 
modern version of a standard computer language, such as C or FORTRAN. 
This certainly works, but unless someone within your organization has already 
developed such a program, the overhead cost is too great. 

The next choice is to formulate a set of equations for the analysis and use 
one of the standard mathematical solution tools such as Mathematica [1999], 
Mathcad [1995], Matlab [1992], Macsyma [Ralston, 1976], and Maple [Ellis, 
1992] to solve the equations. All these systems are powerful and can solve 
logic equations for combinatorial reliability or availability expressions or dif¬ 
ferential equations for Markov model solutions. The choice should be based on 
cost, ease of use, familiarity, availability within your organization, availabil¬ 
ity of “readable” manuals describing how the program is used in reliability or 
availability modeling, and other practical factors. 

Another class of reliability analysis programs is a Monte Carlo solution (see 
Rubinstein [1981] and Shooman [1990]). Such a simulation approach is very 
flexible and allows one to model highly complex behaviors. However, solu¬ 
tion requires the generation of random values for the times to failure, times to 
repair, and other parameters for each “run” of the simulation program. The pro¬ 
gram must be repeated N times, and the probabilities must be estimated from 
the ratios of the number of favorable outcomes divided by N. As N —» oo, 
these ratios approach the true probabilities. The main limitation of such an 
approach is the size of N required and how long one must wait for the running 
of N simulations. At one time, simulation required supercomputers and long 
running times. The method, invented by von Neuman and Ulam, was used 
initially to solve complex nuclear calculations during the Manhattan Project 
at Los Alamos Laboratories in New Mexico and went under the code name 
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“Monte Carlo” for secrecy. (Of course, Monte Carlo evoked the image of the 
games of chance in the casinos of the famous city in Monaco.) Many simu¬ 
lation programs are written in the language SIMSCRIPT (and its successors), 
developed by the RAND corporation in the early 1960s and implemented on 
early IBM computers [Sammet, 1969, p. 657]. We again comment that the 
power and speed of modern desktop and laptop computers makes the Monte 
Carlo method practical for many solutions that previously required prohibitive 
running times. For further details, see Shooman [1990, Section 5.10.4]. 

The methods introduced in the preceding paragraphs are discussed in the 
remainder of this chapter. The next section, however, focuses on the customized 
commercial programs commonly thought of as reliability and availability mode¬ 
ling programs. All such programs start with a model building phase that is based 
on interactive or tabular input or, in more modern cases, an interactive graphical 
editor. The next step is to choose from available component density functions, 
including databases for some components, or to provide input of failure-rate data; 
the program should then formulate the equations for the model without user assis¬ 
tance. The next step is solution of the model equations, which only requires infor¬ 
mation from the user regarding the points in time at which reliability or availabil¬ 
ity values are required. The final phase is the output section, which provides tab¬ 
ular and graphical output in addition to model documentation as selected by the 
user. Most of the programs to be discussed run on Windows ’95, ’98, or 2000 and 
later versions. Some of these programs have alternate versions that can be run on 
a Macintosh- or UNIX-based operating system. 


D2 VARIOUS TYPES OF RELIABILITY AND AVAILABILITY 
PROGRAMS 

D2.1 Part-Count Models 

Reliability and availability programs have been developed by universities, gov¬ 
ernment agencies, military contractors, and various commercial organizations. 
Such programs generally can be grouped under a number of headings. The sim¬ 
plest of such programs are those that implement a so-called part-count model. 
Such a model assumes that all parts are vital to system operation; thus they are 
all in series in a reliability sense. For such a model, the system failure rate is 
the sum of the part failure rates, so programs of this type are essentially large 
databases of part and component failure rates. The analyst starts with a parts 
list, identifies each component, and enters the environmental parameters; the 
program computes the failure rates based on the database and environmental 
and other adjustment factors, or else the user inputs failure-rate parameters. 
One of the most popular failure-rate databases was that contained in the mili¬ 
tary handbook MIL-HDBK-217A, B, C, D, E, and F, published from 1962 to 
1992. Thus many of the earlier part-count programs used these handbooks as 
their databases and were sometimes called MIL-HDBK-217 programs. Newer 
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programs frequently use data collected by the telecommunications industry 
[Bellcore, 1997; Shooman, 1990, p. 643]. 

D2.2 Reliability Block Diagram Models 

The part-count method assumes that all components are in series. This is often 
not the case, however, and an improved model is needed for reliability estima¬ 
tion. The simpler reliability block diagram (RBD) models consider elements 
in various series and parallel combinations. The more general RBD programs 
utilize cut-set or tie-set algorithms, such as those discussed in Section B2.7. 
Sometimes, such models are combined with a failure-rate database; in other 
cases, the analyst must input the set of required failure rates to the program. 
The programs generally include a plotting option that graphs the system R ver¬ 
sus t. Generally, the program allows one to deal with discrete probabilities at 
some point in time (often called demand probabilities). If the system contains 
repair and the system is decoupled, then discrete availabilities can be used in 
the model to compute steady-state availability. 

D2.3 Reliability Fault Tree Models 

The fault tree (FT) method discussed in Section B2.5 and shown in Fig. B13 
introduces an analysis technique that is a competitor to the RBD method. Sys¬ 
tem analysts and designers who focus on system success paths often favor the 
RBD method. Many feel that it is easier to list modes of failure and build a 
system model from these modes; this results in an FT model. Also, those who 
perform safety analysis often claim that the FT method is more intuitive. In 
any event, however, the two methods are mathematically equivalent [Shooman, 
1970]. One can describe the RBD method as a probability of success viewpoint 
and the FT method as a probability of failure viewpoint. The classic FT mod¬ 
eling method is described in McCormick [1981, Chapter 8]; the recent work 
on FT modeling can be found in Dugan [1996]. The exact analytical solution 
of the FT and the RBD methods can both be based on cut sets or tie sets, 
and approximations are frequently incorporated. Cut sets are generally used 
because they represent failure combinations that have small probability in reli¬ 
able systems, and the omission of a cut set in error only has a small effect on 
the computation of reliability or availability. 

D2.4 Markov Models 

If repair is involved in a system, the components are decoupled, and steady- 
state availability solutions are satisfactory, the availability probabilities can 
then be substituted into an RBD or an FT model. If the components are not 
decoupled, discrete-state and continuous-time Markov models would be the 
conventional approaches. Such Markov models result in a set of differential 
equations. The following are five approaches to solving such models: 
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1. Solve the differential equations and obtain a closed-form time function 
that can be plotted. 

2. Solve the differential equations and obtain a numerical solution that can 
be plotted. 

3. Use Laplace transforms to help solve the differential equations and obtain 
a closed-form time function that can be plotted. 

4. Use Laplace transforms to help solve the differential equations and obtain 
a numerical solution that can be plotted. 

5. Solve only for the steady-state values that reduce the differential equa¬ 
tions to a set of algebraic equations that can be more easily solved for 
an algebraic solution. 

The analysis techniques discussed in Chapters 3 and 4 are based on a com¬ 
bination of these five approaches. In many cases, the analytical approach is 
all that is required. For large, complex problems, however, a computer pro¬ 
gram may be required to check analytical approximations, to obtain a solution 
in intractable complex cases, and to document the model solution. The vari¬ 
ous Markov modeling programs allow one or more of these approaches, and 
the simpler ones only use approach (5) for steady state. The more compre¬ 
hensive programs include graphical input programs for constructing a Markov 
state model to define the problem and provide facilities for printing a copy of 
the Markov diagram to document the model. Sometimes the Markov model 
is built around a simulation program that allows great flexibility in modeling 
the repair process. A discussion of the use of one program to perform Markov 
modeling is given in Section D5. 

D2.5 Mathematical Software Systems: Mathcad, Mathematica, and 
Maple 

The vast power of the digital computer stimulated two areas in numerical com¬ 
putation. The most obvious was the consolidation and improvement of the 
many algorithms that were available for computing roots of polynomials, solv¬ 
ing systems of algebraic equations and differential equations, and so on. In 
addition, much research was done on the symbolic solution of expressions; for 
example, integration, differentiation, the closed-form solution of differential 
equations, and the factoring of polynomials. These developments culminated 
in a wide variety of mathematical packages that are very helpful in many anal¬ 
ysis and solution tasks. Some of the leading programs are Mathematica [1999], 
Mathcad [1995], Matlab [1992], Macsyma [Ralston, 1976], and Maple [Ellis, 
1992], There is a great amount of overlap among these programs, for which 
reason the initial comparison should be based on whether the program supports 
symbolic manipulation. The choice of which program to use may be based on 
specific features as well as availability and prior experience with a particular 
program at work. However, if the program must be acquired, its flexibility and 
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TABLE D1 Information on Mathematical Programs 


Product 

Name 

Company Name 
and Address 

Telephone 

Number 

Web Address 

Mathematica 

Wolfram Research, 

Inc. 

100 Trade Center 
Drive 

Champaign, IL 

61820 

(217) 398-0700 

www.wolfram.com 

Mathcad 

MathSoft, Inc. 

101 Main Street 
Cambridge, MA 
02142 

(617) 577-1017 

www.mathsoft.com 

Matlab 

The Math Works 

3 Apple Hill Drive 
Natick, MA 01760 

(508) 647-7000 

www.mathworks.com 

Macsyma 

Symbolics Technology, 
Inc. 

— 

www.symbolics.com 

Maple 

Waterloo Maple, Inc. 

(800) 267-6583 

www.maplesoft.com 


ease of use in addition to one’s confidence in its accuracy and validity should 
all be the major deciding factors. The use of Maple to check analytical solutions 
is discussed in Section D5. A good discussion of some of these programs and 
their origins appears in Ralston [1976, pp. 35-46]. 

At this point, a table comparing the features and prices of such programs as 
well as the availability of test copies is in order. However, because the factors 
change rapidly and all of this information is available on the Web, readers are 
urged to contact the manufacturers to make their own comparison. To facilitate 
such a search, contact information for the programs is provided in Table Dl. 

D2.6 Fault-Tolerant Computing Programs 

In the mid-1970s, researchers in the fault-tolerant computer field began to 
develop specialized reliability and availability programs. These programs 
incorporated a general reliability computation program and added some fea¬ 
tures of special interest to the fault-tolerant held, such as coverage and 
transient faults. Some of the first such programs were developed by Profes¬ 
sor Algirdas Avizienis and his students at UCLA (ARIES ’76 and ARIES 
’82) [Markam, 1982]. Several other programs (ASSIST, CARE, HARP, and 
SHURE) were developed soon after ARIES by various researchers at NASA’s 
Langley Research Center. (For a description of HARP and CARE, see Bavuso 
[1988]; for ASSIST and SHURE, see Johnson [1988].) Some of the more 
recent fault-tolerant programs (e.g., SHARPE) were developed by Professor 
Kishor Trivedi and his students at Duke University [Sahner, 1987, 1995]. 
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D2.7 Risk Analysis Programs 

Risk analysis programs represent another class of large, comprehensive reli¬ 
ability and availability analysis programs. The term risk analysis generally 
implies that the analysis includes the consequences of failure [Shooman, 1990, 
Appendix J], The impetus of such programs was the first probabilistic risk 
analysis for a nuclear power reactor performed for the U.S. Nuclear Regula¬ 
tory Commission (NRC) by a team of experts lead by Professor Neils Ras¬ 
mussen of MIT [McCormick, 1981, p. 240; NRC, 1975]. Risk analysis gen¬ 
erally includes a final stage that predicts the probability of various classes of 
failures (accidents, calamities) and the result of such accidents (property loss, 
injuries, deaths). In addition, the team that conducted the NRC study (known 
in familiar terms as Wash 1400) found that it was difficult to include all the 
possible effects with such a large, complex system as a nuclear power reac¬ 
tor by using reliability block diagrams and fault trees. New techniques called 
event trees and event sequence diagrams were developed to help in the analy¬ 
sis, along with fault tree methods [McCormick, 1981, p. 193]. Such programs 
have been developed for analysis of nuclear reactors and other similar risk 
situations [McCormick, Chapter 10]. 

Several risk analysis programs have been evolved over the past few 
decades, namely: SAPHIRE [Long, 1999]; RISKMAN [Wakefield]; NUPRA, 
REBECCA, and CAFTA/ETA [Smith]. Presently, NASA Headquarters is 
developing a comprehensive risk analysis program called QRAS for its space 
projects [Satie, 1998; Shooman, 2000]. The analyst must judge whether one 
of these risk programs is suitable for fault-tolerant studies. 


D2.8 Software Reliability Programs 

Software reliability models and the supporting programs differ from those of 
hardware reliability. Many programs have been developed by researchers and 
companies; however, three multimodel programs exist: SMERFS, CASRE, and 
SoRel. The best description and comparison of software reliability modeling 
programs is in Appendix A of the Handbook of Software Reliability Engineer¬ 
ing [Stark, 1996]. The SMERFS program was developed in 1983 by the U.S. 
Naval Surface Warfare Center in Dahlgren, Virginia. In 1991, the LAAS Labo¬ 
ratory at the National Center for Scientific Research in Toulouse, France devel¬ 
oped SoRel. The Jet Propulsion Laboratory developed CASRE in 1993. For 
further details, see Stark [1996]. 


D3 TESTING PROGRAMS 

The development of large, comprehensive reliability and availability programs 
requires the technical skills of reliability analysts who know probability and 
reliability methods and those of skilled programmers who can translate the 



TESTING PROGRAMS 


511 



Figure D1 Fault tree for testing reliability programs. 


algorithms into code. Seldom do these skills reside in the same people; gener¬ 
ally, the reliability analysts explain the algorithms and the programmers code 
the program. There is often insufficient coordination and review between the 
two groups, resulting in user-unfriendly interfaces or errors in the algorithms. 
The user should not expect a polished program such as a commercial word 
processor or a spreadsheet program, where the millions of users eventually 
report all the major bugs that are fixed in later releases. 

The user should test any new program by comparing the solutions with those 
obtained with prior programs. The author has found that some of these pro¬ 
grams do not make an exact computation when elementary events are repeated 
in a fault tree. Instead, they use an approximation that is generally (but not 
always) valid. One should test any program to see if it properly computes the 
fault tree given in Fig. Dl. 

The example given in Fig. Dl fails if a and b fail, if a and c and d fail, or if 
e and/fail; thus the cut sets for the example are a'b', a'c'd ', e'f. The correct 
reliability expression for the probability of failure is given by Eq. (Dla). 

Pf = P(a'b' + a'c'd' + e'f) - P(a'b') + P(a'c'd') + P(e'f) 

- P{a'b'a'c'd') - P(a'b'e'f') - P(a'c'd'e'f') 

+ P{a'b'a'c'd'e'f) 


(Dla) 
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The intersection of two like elements obeys the following logic law: x • x = x. 
Thus Eq. (Dla) becomes 

P f = P(a'b' + ac'd' + e'f) = P(ab') + P(ac'd') + P(e'f') 

- P(a'b'c'd') - P(ab'ef ) - P(ac'd'e'f') 

+ P(a'b'c'd'e'f') (Dlb) 

If all the elements are independent and have a probability of failure of q , that 
is, P(a') =P(b') =P(c') =P(d') =P(e') =P(f') =q. the probability of failure 
becomes 


P f = 2q 2 + q 2 -2q 4 -q 5 + q 6 (Die) 

Many programs do not perform the step given in Eq. (Dlb); rather, they expand 
Eq. (Dla) as 


Pf = 2q 2 + q 2 - q 4 - 2q 5 + q 1 


(Did) 


Equations (Die) and (Did) have the same first two terms, but the following 
three terms differ. If q is small, the higher-order powers of q are negligible, 
and the two expressions give approximately the same numerical result. If q is 
not small, however, the two expressions differ. If <7 = 0.2, Eq. (Die) gives Pf 
=0.084544, and Eq. (Did) yields 0.0857728. Larger values of q result in even 
a larger difference—thus caveat emptor! (“let the buyer beware”). 


D4 PARTIAL LIST OF RELIABILITY AND AVAILABILITY 
PROGRAMS 

Many reliability and availability programs exist, all varying greatly in their 
ease of use, facilities provided, and cost. The purchaser should be wary of the 
basic validity of some of these programs, as was discussed in the preceding 
section. The contact information provided in Table D2 should allow users to 
conduct their own search and comparison via the Web. 

Additional reliability and availability software advertisements can be found 
in the end pages of the Proceedings Annual Reliability and Maintainability 
Symposium. Sometimes, specialized reliability programs appear in the litera¬ 
ture, such as the NAVY TIGER program [Luetjen, 1982], which was designed 
to analyze reliability and availability of naval ships and incorporates certain 
preventive maintenance programs used by the U.S. Navy. 
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D5 AN EXAMPLE OF COMPUTER ANALYSIS 

As part of a consulting assignment, the author was asked to derive a closed- 
form analytical solution for a spacecraft system with one on-line element and 
two different standby elements with dormancy failure rates. By dormancy fail¬ 
ure rates, one means that the two standby elements have small but nonzero 
failure rates while on standby. A full Markov model for the three elements 
would require eight states, resulting in eight differential equations. Normally, 
one would use a numerical solution; however, the company staff for whom the 
author was consulting wished to include the solution in a proposal and felt that 
a closed-form solution would be more impressive and that it had to be checked 
for validity. (Errors had been found in previous company derivations). Assum¬ 
ing that the two standby elements had identical on-line and standby failure 
rates allowed a reduction to a six-state model. Formulation of the six equa¬ 
tions, computing the Laplace transforms, and checking the resulting pencil- 
and-paper equations and solutions took the author a day while he worked with 
one of the company’s engineers. 

To check the results, the six basic differential equations were submitted in 
algebraic form to the Maple symbolic equation program, and an algebraic solu¬ 
tion was requested. The first four of the state probabilities were easily checked, 
but the fifth equation took about half a page in printed form and was difficult 
to check. The Maple program provided a factoring function; when it was asked 
to factor the equation, another form was printed. Careful checking showed that 
the second form and the pencil-and-paper solution were both identical. The last 
(sixth) equation was the most complex, for which the Maple solution produced 
an algebraic form with many terms that covered more than a page. Even after 
using the Maple factoring function, it was not possible to show that the two 
equations were identical. As an alternative, the numerical values of the failure 
rates were substituted into the pencil-and-paper solution and numerical values 
were obtained. Failure rates were substituted into the Maple equations, and 
the program was asked for numerical solutions of the differential equations. 
These numerical solutions were identical (within round-off error to many dec¬ 
imal places) and easily checked. 

There are several lessons to be learned from this discussion. The Maple 
symbolic equation program is very useful in checking solutions. However, as 
problems become larger, numerical solutions may be required, though it is pos¬ 
sible that newer versions of Maple or some of the other symbolic programs 
may be easier to use with large problems. Checking an analytical solution is a 
good way of ensuring the accuracy of your results. Even in a very large prob¬ 
lem, it is common to make a simplified model that could be checked in this 
way. Because of potential errors in modeling or in computational programs, it 
is wise to check all results in two ways: (a) by using two different modeling 
programs, or (b) by using an analytical solution (sometimes an approximate 
solution) as well as a modeling program. 
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PROBLEMS 

Dl. Search the Web for reliability and availability analysis programs. Make 
a table comparing the type of program, the platforms supported, and the 
cost. 
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D2. Use a reliability analysis program to compute the reliability for the first 
three systems in Table 7.8 and check the reliability. 

D3. Use a symbolic modeling program to check Eq. (3.56). 

D4. Use a Markov modeling program to check the results given in Eq. (3.58). 

D5. Use a fault tree program to solve the model of Fig. D1 to see if the results 
agree with Eqs. (Die) or (d). 
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Tandem, 126, 127, 183 
Univac, 146, 147 
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Burst, 62 
code, decoder, 65 
decoder failure, 73-75 
encoder, 65 
errors, 32, 62 
properties, 64, 65 
Reed-Solomon, 72-75, 126 

CAID, 119 (see also RAID) 

Chinese Remainder Theorem, 66-71 
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Cluster, of computers, 135, 136 

Coding methods, burst codes (see Burst) 
check bits, 35 
cryptanalysis, 29 (1.25), 30 
error-correcting, 2, 31 
error-detecting, 2, 31 
errors, 32, 33 

Hamming codes, 31, 44-47, 54 
Hamming distance, 34, 45, 46 
other codes, 45, 75, 76 
parity-bit codes, 35, 37 
coder (encoder, generator), 37, 38, 40, 

42 

coder-decoder failures, 43, 53-59 
decoder (checker), 37, 38, 40, 42 
probability of undetected errors, 32, 
39^12, 45, 52-53, 59-62 
RAID levels 2-6, 121-126 
Reed-Solomon codes (see Burst) 
reliability models (see also Probability of 
undetected errors) 
retransmission codes, 59-62 
single error-detecting and double error- 
" detecting (SECDED), 47, 51-52 
single error-correcting and single error- 
" detecting (SECSED), 47-51 
soft fails, 33 

Cold redundancy, cold standby (see Standby 
systems) 

Computer, CDC 6600, 11 
ENIAC, 4 
history, 4, 5 
Mark I, 4 

Conditional reliability, 390, 391 

Coverage, 115, 117 

Cryptography (see Coding methods, 
cryptanalysis) 

Cut-set methods (see also Network reliability, 
definition; Reliability modeling) 

Dependent failures (Common mode failures), 
(see Reliability theory, combinatorial 
reliability) 

Downtime, 14, 134 (see also Availability) 

EMC (see RAID) 

ESS (see Availability, typical computer 
systems) 

Fault-tolerant computing, calculations, 12, 13 
definition, 1 

Furby, 9 


Global Positioning System (GPS), 195, 

280 (5.6) 

Hamming codes (see Coding methods) 
Hamming distance (see Coding methods) 
Hazard (see also Reliability modeling, failure 
rate), derivation, 222-224 
function of system, 94, 95 
Himalaya computers (see Tandem) 

Hot redundancy, hot standby (see Parallel 
systems) 

Human operator reliability, 202 

Laplace transforms, as an aid in computing 
MTTF, 169, 170, 174, 175, 468, 469 
definition, 462^-64 
of derivatives, 465, 466 
final value computation, 170, 182 
initial value approximation, 469^-71 
initial value computation, 173, 174 
of Markov models, 93 
partial fractions, 466, 467 
table of theorems, 468 
table of transforms, 465 
Library of Congress, 10, 27 (1.1), (1.3), 
(1.19) 
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Probability) 
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collapsed graphs (see merger) 
complexity, 453, 454, 461 
decoupling (see uncoupled) 
formulation, 104-108, 112-117, 446-450 
graphs, 450 

Laplace transforms, 461-468 
merger of states, 166, 453, 454 
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solution of Markov equations, 106, 108, 
115-117, 118, 166-179 
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Poisson process, 404-407 
process, 404 
properties, 403, 404 
transition matrix, 407, 408 
two-element model, 450-453 
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Mean time between failure (MTBF) (see 
Mean time to failure) 

Mean time to failure (MTTF), 95, 96, 114, 
115, 117, 140 (3.16), 169, 170, 174 
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constant-failure rate (hazard), 224, 225 
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linearly increasing hazard, 225 
RAID, 120, 123, 125 
tables of, 115, 117 
TMR, 151-153 

Mean time to repair (MTTR), 112-119, 126, 
127 
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Microsoft, 5 

MIL-HDBK-217, 79 (2.7), 427, 506 
Moore’s Law, 5-8 

NASA, 

Apollo, 194 

Space Shuttle, 188, 194, 266-269 
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283-285 

ARPA network, 312 
availability, 286-288 
computer solutions, 308, 309 
definition, 285, 288 
all-terminal, 286 

cut-set and tie-set methods, 303-305 
event-space, 302, 303 
graph transformations, 305-308 
^-terminal, 286, 308 
two-terminal, 286, 288-301 

cut-set and tie-set methods, 292-294 
graph transformations, 297-301 
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state-space, 288-292 
subset approximations, 296, 297 
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design approaches, 309-321 
adjacency matrices, 312, 313 
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310-312 

enhancement phase, 318-321 
Hamiltonian tours, 317, 327 (6.14), 

328 (6.15)-(6.17) 
incidence matrices, 312, 313 
Kruskal’s and Prim’s algorithms, 312, 
314-318 

spanning trees, 314-318 
graph models, 284, 285 
A-modular redundancy, 2, 145, 146, 153-161 
history, 146, 147 
repair, 165-183, 454^161 
triple modular redundancy (TMR), 147, 
148, 149-153, 

comparison with parallel and standby 


systems 178, 179 
Markov models, 166-170 
MTTF, 151-153 
voter logic, 161-165 
adaptive voting, 194 
adjudicator algorithms, 189-195 
comparison of reliability, 193 
consensus voting, 190-192 
pairwise comparison, 191, 193 
test and switch, 191 
voters, 154-161 

voting with lockout, 186, 188, 189 
NMR (see A-modular redundancy) 

A-version (see Software reliability) 

Parallel systems, 2, 83, 97-99, 104 (see also 
Reliability optimization) 
comparison with standby, 108-111, 178, 

179 

MTTF, 96, 114, 115 
Polynomial roots, 165, 166 
Probability, complement, 388 
conditional, 390-391 
continuous random variables, 395-401 
density and distribution function, 

395-397 

exponential distribution, 397-399, 403, 
433, 434 

Normal (Gaussian) distribution, 398, 400, 
401, 403 

Rayleigh distribution, 398, 399, 403, 434 
rectangular (uniform) distribution, 397, 
398 

Weibull distribution, 398, 399, 403, 
434^138 

discrete random variables, 391-395 
binomial (Bernoulli) distribution, 393, 
394, 403 

density function, 391, 392 
distribution function, 392, 393 
Poisson distribution, 185, 395, 396, 
403^107 

Markov models (see Markov models) 
moments, 401-403 
expected value, 401, 402 
mean 402, 403 
variance, 403 

Probability of undetected error (see Coding 
methods) 

RAID, Advisory Board, 120 
EMC Symmetrix, 10, 27 (1.1) 
levels, 121-126 
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mirrored disks, 122 
reliability, 119-126 
stripping, 125 

RBD (see Reliability modeling) 

Redundancy (see also Parallel systems. 
Reliability optimization), 
component, 86-92 
couplers, 91, 92 
system, 86-92 

Reliability allocation (see Reliability 
optimization) 

Reliability analysis programs 
example, 514 

fault-tolerant computing programs, ARIES, 
509 

ASSIST, 509 
CARE, 509 
HARP, 509 
SHAPE, 509 
SHURE, 509 

mathematics packages, Macsyma, 256, 505, 
508, 509 

Maple, 256, 505, 508, 509 
Mathcad, 256, 505, 508, 509 
Mathematica, 256, 505, 508, 509 
Matlab, 505, 508, 509 
partial list, 512, 513 
risk analysis, CAFTA, 510 
NUPRA, 510 
QRAS, 510 
REBECCA, 510 
RISKMAN, 510 
SAPHIRE, 510 

software reliability (see Software reliability, 
programs) 

testing programs, 510-512 
Reliability modeling (see also Reliability 
theory), 

block diagram (RBD), 413, 444 
cut-set methods, 292-294, 419, 420 
density function, 218-221 
distribution function, 218-221 
event-space, 288-292 
failure rate, 222-224 (see also Hazard) 
graph (see block diagram) 
probability of success, 219-221 
reliability function, 218-221 
system, example, auto-brake system, 
442-446 

parallel, 440, 441 
r-out-of-n structure, 441, 442 
series 438—440 
theory, 218-221 


tie-set methods, 292-294, 419, 420 
Reliability optimization, algorithms, 359-365 
apportionment 85, 86, 342-349, 366, 367 
Albert's method, 345-349 
availability, 349-351 
equal weighting, 343 
relative difficulty, 344, 345 
relative failure rates, 345 
communication system, 383 (7.31) 
concepts, 11, 12, 85, 86, 332-334 
decomposition, 337-340 (see also Software 
development, graph model) 
definition, 2, 4, 334-336 
dynamic programming, 371-379 
greedy algorithm, 369-371 
interfaces, 340 
minimum bounds, 341, 342 
multiple constraints, 365, 366 
parallel redundancy, 336 
redundant components, 336 
subsystem, 340-342 
bounded enumeration, 353-359 
lower bounds (minimum system 
design), 354-357 

upper bounds (augmentation policy), 
358, 359 

exhaustive enumeration, 351-353 
series system, 335 
standby redundancy, 336, 337 
standby system, 367-369 
Reliability theory (see also Reliability 
modeling) 

combinatorial reliability, 412, 413 
exponential expansions, 92-94 
parallel configuration, 415, 416 
r-out-of-n configuration, 416, 417 
series configuration, 413-415 
common mode effects, 99-101 
cut-set and tie-set methods, 419, 420 
failure mode and effect analysis (FMEA), 
418, 419, 443 

failure-rate models, 421-429 
density function, 422-425, 429-431 
distribution function, 423-425 
failure data, 421-425 
bathtub curve, 425, 426 
handbook, 425-427 
integrated circuits, 427^-29 
hazard function, 422-424, 432^138 
reliability function, 423—425 
fault-tree analysis (FTA), 418, 445 
history, 411, 412 

reliability block diagram (RBD), 413 
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reliability graph, 413 
Repairable systems, 111-117 
availability function, 111, 117-119 
reliability function. 111 
single-element Markov models, 115 
two-element Markov models, 112, 115, 116 
Redundancy (see parallel systems) 
couplers, 91, 92 
microcode-level, 186 

Rollback and recovery (recovery block), 191, 
203, 268-275 
checkpointing, 274, 275 
distributed storage, 275 
journaling, 272, 273 
rebooting, 270, 271 
recovery, 271, 272 
retry, 273, 274 
r-out-of-n system, 101-104 

SABRE, 135 

SECDED (see Coding methods) 

Software development, 203, 205 
build, 218, 221 
coding, 208, 214, 215 
error, 225-227 (see also Software 
Reliability, error models) 
fault, 225, 226 
graph model, 211-214 
hierarchy diagram (H-diagram), 211-214 
life cycle, 207-218 
deployment, 208 
design, 208, 211-214 
incremental model, 221 
maintenance, 208 
needs document, 207, 208 
object-oriented programming (OOP), 207 
phases, 208 
pseudocode, 226 

rapid prototype model, 208, 210, 220 
redesign, 208, 218 
requirements document, 208, 209 
specifications document, 208-210 
structured procedural programming 
(SPP), 207, 215 
warranty, 218 
waterfall model, 219 
process diagrams, 218-221 
reused code (legacy code), 210 
source lines of code (SLOC), 210, 211, 214 
testing, 208, 215-218 
Software engineering (see Software 
development) 

Software Engineering Institute, 268 


Software fail-safe, 203 
Software redundancy, 262 
Software reliability, 203, 204 
data, error, 203, 225-227 
generation, 227-229 
models, 225-236 
removal, 227-229 

constant-rate, 230-232 
exponentially decreasing rate, 234-236 
linearly decreasing rate, 232-234 
S-shaped, 235, 236 
hardware, operator, software, 202 
independence, 202 
macro models, 262 

mean time to failure (MTTF), 238-241, 
245-246, 
models, 237-250 
Bayesian, 261 
comparison, 249-250 
constant error-removal-rate, 238-242 
exponentially decreasing error-removal 
rate, 246-248 

linearly decreasing error-removal rate, 
242-246 

model-constant estimation, 250-258 
from development test data, 260 
handbook estimation, 250-252 
least-squares estimation, 256, 257 
maximum-likelihood estimation, 
257-258 

moment estimation, 252-256 
other models, 258-262 
W-version programming, 263-268 
programs, CASRE, 258 
Markov models, 507, 508 
reliability block diagram, 507 
reliability fault tree models, 507 
reliable software, 203 
SMERFS, 258 
software development, 205 
SoRel, 258 

Space Shuttle (see NASA) 

Standby systems, 83, 104 
comparison with parallel, 108-111, 178, 

179 

redundancy, 2 
Storage errors, CD, 62 
CD-ROM, 62 
STRATUS, 122, 131-135 
availability, 134, 135 
Continuum, 134 
Stuck-at-one, 147 
Stuck-at-zero, 147 
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Sun, 136, 137 
Syndrome, 51-56, 66 

Tandem, 122, 126-131, 136 
Guardian, 127 
Himalaya, 126, 129 
Technology timeline, 4 
Telephone switching systems, 15, 16 
Three-state elements, 92 
Tie-set methods (see Reliability modeling; 
Network reliability, definition) 


Triple modular redundancy (TMR) (see 
iV-modular redundancy) 

Undetected errors, 32 

Uptime, 14, 134 (see also Availability) 

VAX, 136 

Voting (see A-modular redundancy) 
Year 2000 Problem (Y2K), 205-208 
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